Open source · MIT · pip install scrapesome

Scrape any page.
Get clean data out.

A lightweight Python library with sync & async support, automatic Playwright fallback for JS-heavy pages, and built-in output formatting. Zero boilerplate.

GitHub Docs

$ pip install scrapesome

example.py

from scrapesome import sync_scraper # returns clean Markdown, no setup needed content = sync_scraper( "https://example.com", output_format_type="markdown", timeout=10, force_playwright=False, allow_redirects=True ) print(content)

2

Fetch engines (HTTP + Playwright)

4

Output formats

0

Required config to get started

MIT

Open-source license

Core features

Robust scraping.
Minimal setup.

Handles the edge cases — retries, JavaScript, timeouts, redirects — so you don't have to wire them up yourself.

Sync & Async support

Both sync_scraper and async_scraper built in. Use whichever fits your stack — no wrappers needed.

Playwright fallback

Automatically falls back to a headless Chromium browser when plain HTTP gets a 403 or returns blank JS-rendered content.

Smart retries

Built-in retry logic with intelligent fallbacks. Transient failures don't crash your pipeline.

4 output formats

Get back HTML, plain text, Markdown, or JSON from any page — selected per-request, zero extra parsing.

Custom headers & agents

Pass any HTTP header or custom User-Agent list. Control redirect following and timeouts per request.

File saving & .env config

Save scraped output directly to disk. Configure defaults via .env — timeouts, log level, user agents, format.

How it works

Three steps to clean data

scrapesome picks the fastest path automatically — fallback only fires when needed.

01

Step 01

Call the scraper

Import sync_scraper or async_scraper and pass a URL. That's the minimum. All other options are optional.

Input: URL + optional params

02

Step 02

Auto-select engine

scrapesome tries a fast HTTP fetch first. If it gets a 403, blank body, or detects JS-rendered content, it seamlessly retries with Playwright — or you can force it.

Auto-detects: HTTP or Playwright

03

Step 03

Receive formatted output

Get back clean HTML, text, Markdown, or JSON — ready to store, search, embed into an LLM pipeline, or save to a file.

Output: HTML · Text · Markdown · JSON

Output formats

Data exactly how you need it

Choose per request with the output_format_type parameter. Switch freely between formats.

HTML

Full raw markup — ideal for DOM parsing, archiving, or feeding into an HTML parser

JSON

Structured key-value output — title, description, URL — perfect for APIs and databases

Markdown

Readable structured text — great for LLM context, documentation, or readable exports

Text

Stripped plain text — minimal noise, ideal for search indexing, NLP, or summarisation

Usage

Sync and async,
your choice

Both modes are first-class citizens. Drop into an async pipeline or use the synchronous API for scripts — same options, same output.

Force Playwright for JS-rendered SPAs
Set custom User-Agent strings
Toggle redirect following
Save output directly to file
Configure defaults via .env

Read the docs

Synchronous

from scrapesome import sync_scraper

# basic
html = sync_scraper("https://example.com")

# with options
result = sync_scraper(
  "https://example.com",
  output_format_type="json",
  force_playwright=True,
  allow_redirects=True,
  user_agents=["Mozilla/5.0"]
)

# save to file
result = sync_scraper(
  "https://example.com",
  output_format_type="markdown",
  save_to_file=True,
  file_name="output"
)
print(result["file"])  # output.md

Asynchronous

import asyncio
from scrapesome import async_scraper

html = asyncio.run(
  async_scraper("https://example.com")
)

# or inside an async function
async def scrape():
  content = await async_scraper(
    "https://example.com",
    output_format_type="markdown",
    force_playwright=False
  )
  return content

CLI

Scrape from the
command line too

Install with CLI extras and use scrapesome scrape from any terminal. Same options as the Python API.

$ pip install scrapesome[cli]

--urlTarget URL to scrape

--output-formathtml · json · markdown · text

--force-playwrightForce JS rendering via Playwright

--async-modeUse asynchronous scraping

--save-to-file -sSave output to file

--file-name -nBase filename (extension auto-added)

Terminal

# basic scrape
scrapesome scrape --url https://example.com

# markdown output
scrapesome scrape \
  --url https://example.com \
  --output-format markdown

# save to file
scrapesome scrape \
  --url https://example.com \
  --output-format markdown \
  --save-to-file \
  --file-name output
# → saves output.md

# force JS rendering + async
scrapesome scrape \
  --url https://example.com \
  --force-playwright \
  --async-mode \
  --output-format json

Why scrapesome

How it compares

Most alternatives require you to wire up retries, fallbacks, and output parsing yourself. scrapesome ships all of that.

Feature	scrapesome	Playwright (py)	Requests-HTML	Scrapy
JS rendering	✓ Auto fallback	✓ Always	~ Partial	~ Plugin
Auto 403/blank fallback	✓ Seamless	✗ Manual	✗ No	✗ Manual
Sync + Async	✓ Built-in	✗ Async only	✗ Sync only	✗ Async only
JSON / Markdown / Text output	✓ Built-in	✗ Manual	✗ Basic	✗ Pipeline
Retries + smart defaults	✓ Built-in	✗ Manual	✗ Limited	~ Settings
pip install & go	✓ < 1 min	✗ Browser setup	✓ Yes	✗ Project setup
.env configuration	✓ Optional	✗ Code only	✗ Hardcoded	~ Project

Installation

Up and running
in under a minute

Core library works out of the box. Playwright is only needed if you want JavaScript rendering.

1

Install the library

Run pip install scrapesome. That's enough to start scraping static pages immediately.

2

Install Playwright (optional)

For JS-rendered pages: pip install playwright then playwright install to download browsers.

3

System deps on Linux

Run playwright install-deps on Ubuntu/Debian to install required system libraries for Playwright.

4

Optional: .env config

Create a .env file to set LOG_LEVEL, OUTPUT_FORMAT, FETCH_PAGE_TIMEOUT, and USER_AGENTS as defaults.

Web Scraper

Live

Enter a URL, pick your format, and run. Results appear below.

Output format

HTML JSON MD Text

Timeout

10s

Force Playwright

Skip HTTP — go straight to headless browser

Allow redirects

Follow 3xx responses automatically

About scrapesome

scrapesome is a lightweight, flexible Python web scraping library with both synchronous and asynchronous support. It handles the common pain points — JavaScript rendering, retries, redirects, timeouts, and output formatting — so you can focus on the data, not the plumbing.

The library uses plain HTTP requests first for speed. When a page returns a 403 or blank JS-rendered body, it automatically falls back to Playwright (headless Chromium). You can also force Playwright directly for SPAs.

Output can be returned as raw HTML, structured JSON, Markdown, or plain text — selected per request. A CLI is available for one-off scrapes from the terminal, and all defaults can be configured via a .env file.

scrapesome is MIT licensed and built for developers who need robust scraping utilities with minimal setup. Contributions, bug reports, and feature requests are welcome on GitHub.

GitHub PyPI Documentation

Scrape any page.Get clean data out.

Robust scraping.Minimal setup.