Skip to main content

Infrastructure

Sigil Bot

An autonomous scanner that monitors PyPI, npm, ClawHub, and GitHub for new and updated packages, scans them with all eight Sigil phases, and publishes results to the public scan database. Runs 24/7 — no human input required.

What Sigil Bot does

Sigil Bot watches public package registries for newly published and updated packages. When a new package appears, the bot downloads it into quarantine, runs all eight scan phases, stores the results, and publishes a report page at sigilsec.ai/scans.

                ┌──────────────────────┐
                │     SIGIL BOT        │
                │                      │
                │  Monitors registries  │
                │  Downloads packages   │
                │  Runs Sigil scans     │
                │  Stores results       │
                └──────────┬───────────┘
                           │
          ┌────────────────┼────────────────┐
          │                │                │
          ▼                ▼                ▼
   ┌────────────┐  ┌────────────┐  ┌────────────┐
   │ Scan DB    │  │ Badges     │  │ Threat     │
   │ /scans/*   │  │ /badge/*   │  │ Feed       │
   │ pages      │  │ SVGs       │  │ RSS + API  │
   └────────────┘  └────────────┘  └────────────┘

Public scan database

Every scanned package gets a report page. Each page is an SEO surface that AI models and search engines can cite.

Real-time threat feed

New scans published as they happen via RSS feed, API endpoint, and alerts for HIGH RISK and CRITICAL RISK findings.

Badge generation

Automatically generates and caches SVG badges for every scanned package. Badges update when packages are rescanned.

Downstream integrations

The GitHub App, MCP server, and CLI threat intelligence all consume scan data produced by the bot.

Monitored registries

Four registries are monitored continuously. Each has a dedicated watcher process with optimised polling for that registry's API.

PyPI

Polls every 5 min

RSS feeds for new packages and version updates, plus the changelog serial API for incremental event tracking. Packages are downloaded via pip download --no-deps — no code is installed or executed.

Feedpypi.org/rss/packages.xml + changelog serial API
ScopeAI ecosystem packages (langchain, openai, anthropic, mcp, agent, etc.)
Volume~200–400 relevant packages/day

npm

Polls every 60 sec

CouchDB _changes stream from the npm registry replicate. Packages in @langchain/*, @anthropic/*, @openai/*, and @modelcontextprotocol/* scopes are scanned regardless of keyword matches.

Feedreplicate.npmjs.com/registry/_changes
ScopeAI ecosystem packages + all MCP-related scopes
Volume~300–600 relevant packages/day

ClawHub

Polls every 6 hours

REST API paginated by update time. All skills are scanned — no keyword filtering needed. The entire registry is relevant because every skill has direct access to the user's environment.

Feedclawhub.ai/api/v1/skills?sort=updated
ScopeAll skills (no filtering)
Volume~50–100 new/updated skills per day

GitHub (MCP Servers)

Sweeps every 12 hours

GitHub Search API for repositories matching MCP server patterns, plus the Events API for push events to known repos between sweeps. Repositories are cloned with git clone --depth 1 into quarantine.

Feedapi.github.com/search/repositories + /events
ScopeMCP server repos (>0 stars or >1 commit)
Volume~20–50 new/updated repos per day

Scan pipeline

Every scan follows the same five-stage pipeline: watch, queue, scan, store, publish.

WATCHER ──▶ QUEUE ──▶ SCANNER ──▶ STORE ──▶ PUBLISHER
 Poll feeds   Redis     Download    Postgres   Report page
 Deduplicate  Priority  Extract     Findings   Badge cache
 Filter       Retry     Sigil scan  Metadata   RSS feed
 Enqueue      Backoff   All phases             Alerts

Deduplication

Key: {ecosystem}:{name}:{version}:{content_hash}. If the exact same content has been scanned, it's skipped. If the version is the same but the content hash differs (re-upload), it's rescanned.

Priority levels

PrioritySLACriteria
criticalImmediateTyposquatting patterns, suspicious new publisher names
high5 minMCP scopes, ClawHub skills, popular packages with new versions
normal30 minEverything else matching AI keyword filters

Scan isolation

Each scan runs in a fresh temporary directory. No network access during the scan — Sigil is static analysis only. No code is installed or executed. The quarantine directory is destroyed after scanning.

Typosquatting detection

New packages with names within edit distance 2 of popular AI packages are automatically boosted to critical priority. This catches common squatting patterns before developers encounter them.

text
Target packages monitored for typosquats:
  langchain, openai, anthropic, transformers,
  huggingface, crewai, autogen, llamaindex,
  pinecone, chromadb, fastapi, streamlit

Detection patterns:
  Character substitution: langch4in, openal
  Character insertion:    langchainn, openaai
  Character deletion:     langchai, opena
  Transposition:          langchian, openia

Flagged packages receive an additional finding in the Provenance phase noting the name similarity.

Threat feed

Scan results are published to multiple output channels for downstream consumption.

RSS feed

Standard RSS 2.0 feed at sigilsec.ai/feed.xml. Contains the latest 100 scan results. Supports filtered variants:

text
All scans:      sigilsec.ai/feed.xml
Threats only:   sigilsec.ai/feed.xml?verdict=high_risk,critical_risk
ClawHub only:   sigilsec.ai/feed.xml?ecosystem=clawhub
PyPI only:      sigilsec.ai/feed.xml?ecosystem=pypi
npm only:       sigilsec.ai/feed.xml?ecosystem=npm

API endpoint

bash
GET /api/v1/feed?ecosystem={eco}&verdict={v}&limit={n}&since={iso_datetime}

JSON array of recent scans. Same filtering as RSS. This is what the MCP server queries, the GitHub App looks up, and third-party integrations consume.

Alerts

HIGH RISK and CRITICAL RISK findings trigger alerts to subscribed webhook endpoints. Only findings with a risk score of 25 or above generate alerts.

Scan attestations

Every scan produced by Sigil Bot is cryptographically signed and recorded in a public transparency log. This lets anyone verify that a scan result is genuine and untampered.

Ed25519 signatures

Each scan is wrapped in a DSSE envelope and signed with an Ed25519 key. The public key is published at /.well-known/sigil-verify.json.

in-toto attestations

Attestations follow the in-toto Statement v1 format with a custom predicate type for Sigil scan results.

Transparency log

Signed attestations are recorded in the Sigstore Rekor transparency log. Each scan report links to its log entry.

Verification API

Verify any scan via GET /api/v1/verify?scan_id=... or fetch the raw attestation from GET /api/v1/attestation/{id}.

For full verification steps, public keys, and SDK usage, see the Attestation docs.

AI ecosystem filtering

The bot doesn't scan every package on PyPI and npm — it targets the AI agent supply chain. Packages are matched if their name, description, or keywords contain any of these terms:

text
Frameworks:    langchain, crewai, autogen, llamaindex, haystack, dspy
LLM providers: openai, anthropic, cohere, mistral, groq, together
MCP / agents:  mcp, model-context-protocol, agentic, tool-use
RAG:           rag, retrieval, vector, embedding, pinecone, chroma
ML:            transformers, huggingface, diffusers, torch, tensorflow
Skills:        skill, plugin, extension, chatgpt-plugin, copilot-extension
Full coverage registries
All ClawHub skills and GitHub MCP server repos are scanned regardless of keyword matches. No filtering is applied to these registries.

Expected volume

RegistryScans/dayAvg timeCompute
PyPI (AI-filtered)200–400~5 sec~30 min
npm (AI-filtered)300–600~5 sec~50 min
ClawHub50–100~3 sec~5 min
GitHub MCP20–50~8 sec~7 min
Total570–1,150~90 min

Bot identity

The bot operates under a dedicated sigil-bot account, separate from NOMARK staff activity. Automated outputs are clearly labeled as automated.

  • GitHub: The GitHub App acts as sigil-bot[bot]
  • Scan database: Report pages show “Scanned by Sigil Bot” with timestamp
  • Threat feed: RSS and API entries attributed to the bot identity
Note
Automated scan results are clearly labeled as automated output. Verdicts are statements of algorithmic opinion — see our Methodology and Terms of Service.

Dispute a result

Packages are scanned automatically from public registries without author consent. If you believe a scan result is incorrect, you can:

Disputes are acknowledged within 48 hours. See the full dispute process in our Terms of Service.

See also

Need help?

Ask a question in GitHub Discussions or check the troubleshooting guide.