Coming Soon
Search the content-addressed web

About Semaris

Semaris is building a fast, open search layer for the decentralized web, making public data on IPFS and Filecoin discoverable and useful. Our modular system combines a distributed crawler, content-aware extractor, and a high-performance vector search index, all designed for real-time updates and sub-second search. We solve the problem of content invisibility in content-addressed networks, enabling developers, researchers, and data owners to find, reuse, and trust the data they store. With a focus on performance, open SDKs, and a clear roadmap to decentralization, Semaris aims to be the foundation for search and discovery in the next era of the internet.

Problem Statement: Discovery in Content-Addressed Networks

Content-addressed networks like IPFS are designed for retrieval, not discovery. Once a CID is known, a user can fetch the associated content from any gateway or node that pins it. But finding a CID—without already possessing it—is nontrivial. There is no built-in search layer in the IPFS protocol. Developers resort to site:ipfs.io Google queries, manual scraping of public gateways, or scanning the DHT with scripts that stall at scale. Filecoin adds permanence and incentive layers but doesn't address this visibility gap. The result is that the majority of useful, public data—PDFs, CSVs, governance records, NFTs, and scientific datasets—remains inaccessible unless directly linked or manually indexed.

Operational symptoms of this discovery failure are widely acknowledged. Pinning services and storage providers report repeated uploads of identical files due to users being unaware of pre-existing CIDs. Gateway analytics also show a high volume of 404s and re-fetch attempts. While no open dataset quantifies duplicate uploads across the network, informal estimates from ecosystem maintainers suggest 15–25% of pins in public pinning layers are redundant. Storage costs, egress fees, and user frustration increase accordingly.

In terms of search functionality, no public tool today meets modern expectations. Platforms like ipfs-search.com and CID Gravity offer basic lookups but lack semantic parsing, relevance scoring, or filterable fields. Crawlers like Estuary maintain static indexes with unpredictable update cycles. Our internal benchmarks across five public tools show that most index under 10M unique CIDs, have latencies over 1 second, and return poorly ranked results with no support for field filtering, snippet previews, or similarity search. Meanwhile, the IPFS network continues to grow: independent DHT crawlers suggest hundreds of millions of unique CIDs are publicly routable, and Filecoin reports over 2 EiB of active data as of early 2025.

System Architecture Overview

The architecture is designed to index content-addressed data at scale using modular, stateless components connected through a resilient messaging layer. It consists of three primary services—Crawler, Extractor, and Indexer—linked by Redis Streams queues that enable fault-tolerant processing, horizontal scalability, and real-time ingestion. Each component can be deployed independently, auto-scales on queue depth, and maintains strict separation of concerns. This modular approach supports low-latency dataflow and flexible expansion across geographies and protocols.

Crawler

The Crawler is responsible for CID discovery. It walks the IPFS Kademlia DHT using libp2p to find routable peers and fetches DAG links via gateway /refs endpoints. It emits discovered CIDs to a Redis Stream (cids:queue) for downstream processing. To prevent circular references or redundant ingestion, each CID is checked against a Redis Bloom filter before it is published. A single crawler node running on a 4-core instance can identify 2,500–4,000 unique CIDs per minute, depending on gateway responsiveness and peer availability.

Extractor

The Extractor pulls CIDs from the queue and attempts to fetch their associated content. It first resolves through a prioritized list of public gateways, falling back on mirrors if necessary. MIME-type sniffing is performed to determine the format—common types include JSON, CSV, HTML, PDF, and raw binaries. The extractor uses content-specific parsers to extract fields, flatten nested structures, and generate semantic snippets.

Indexer

The Indexer consumes enriched documents from the docs:queue and performs two parallel writes:

  1. Structured metadata (CID, MIME type, extracted fields, file size, etc.) is written to Postgres for deterministic queries.
  2. Full-text content and vector embeddings are written to OpenSearch. Embeddings are generated via a small transformer model or a pre-computed lookup table, depending on content type.

Performance Goals & Engineering Constraints

This system is designed to meet strict performance thresholds across ingestion, indexing, and query—making it viable as both a public good and a commercial-grade API service.

Target Metrics

Metric Target Rationale
p95 search latency < 250 ms Enables live integrations (dashboards, embeds, autocomplete)
Index freshness < 1 hour Keeps up with DAO votes, frontend pushes, dataset updates
Ingestion throughput ≥ 2,500 CIDs/min/node Matches realistic DHT/gateway discovery rates per crawler
Query concurrency ≥ 100 QPS/node Required for embedded usage by wallets, dashboards, or public portals
Uptime (query + ingest) 99.99% Suitable for mission-critical integrations and grant-funded infrastructure

Roadmap to Decentralization

While the current system is deployed as a centralized SaaS product for speed, control, and reliability, the architecture is deliberately modular to support future decentralization. The long-term objective is to build a search layer that can federate compute, verify index integrity, and distribute query infrastructure across decentralized networks—without sacrificing performance guarantees.

Near-Term (0–12 Months)

Goal: Prove performance, establish demand, and harden infrastructure.

  • Continue centralized deployment using managed Redis, Postgres, and OpenSearch.
  • Release CLI + SDKs (Go, JS, Python) under permissive OSS license.
  • Establish data integrity practices (content hash validation, parse receipts).
  • Build usage telemetry and performance observability into the ingestion pipeline.
  • Begin integration testing with Akash and Filecoin for future compute/storage migration.

Mid-Term (12–24 Months)

Goal: Modularize components for decentralized execution and incentivize external contribution.

  • Extractor & Indexer to Akash: Containerized compute jobs migrate to Akash, with proof-of-execution receipts.
  • Index mirroring to Filecoin: Periodic snapshots of indices are committed to Filecoin storage deals.
  • Crawler federation: Allow external operators to contribute CID discovery to a shared stream.
  • Public Gateway Monitoring Layer: Track latency and reliability across gateway endpoints.

🔍 Frequently Asked Questions

What exactly does this project do?

We provide fast, semantic search across public data stored on IPFS—whether it's JSON metadata, DAO records, CSV datasets, or PDF documents. If it has a CID, we make it searchable.

Why does IPFS need a search engine?

Because IPFS is a retrieval protocol—not a discovery layer. You need to already know the CID to fetch content. We index public CIDs, extract content, and make them discoverable via field queries, full-text search, and vector similarity.

Is this a blockchain? A token?

No. We're an indexing and search layer built on top of existing decentralized storage (IPFS/Filecoin). There is no native token. We monetize via API subscriptions and minimal UI ads.

How is this different from IPFS-search.com or Estuary?

They index small slices of the network and lack semantic relevance, field filters, or usable APIs. We aim for full-network visibility, real-time updates, and sub-second ranked search results.

Can I search private or encrypted data?

No. We only index publicly accessible CIDs from gateways, the DHT, and published pinning services. We do not attempt to access or infer private datasets.

Is this open-source?

Yes. Our crawler, extractor, and SDKs are being released under permissive OSS licenses. Our roadmap includes federation, proof-of-discovery, and decentralized index mirroring.

Will this ever be decentralized?

Yes. Today's system is centralized to ensure speed and stability. Over time, we're migrating compute (to Akash), storage (to Filecoin), and CID discovery (via federated crawlers with provable inclusion).

How do I integrate this into my app or service?

You can use our REST API, CLI tool, or language SDKs (Go, JavaScript, Python). Search CIDs, filter by field, or run semantic queries—all in under 250ms.

How do you make money?

Through paid API plans (starting at $1K/month for 1M queries) and a single non-invasive ad slot in the public UI. This lets us stay grant-independent and fund OSS contributions.

How do I get involved?

If you manage decentralized content, run a pinning service, or maintain public datasets—contact us to integrate, co-index, or sponsor coverage. If you fund infra, we're open to aligned grants and partnerships.