Skip to content

My Web Intelligence (MWI)

MWI is a reproducible research toolkit to collect web corpora, qualify/enrich them (NLP/LLM-assisted, auditable), and export interpretable outputs (CSV/JSON/GEXF) for digital methods in social sciences and communication studies.

Start here (flagship)

Use this repository first:

Do not start with mywebapi unless you explicitly need a scalable backend.


Quickstart (get a first result fast)

Recommended: Docker Compose.

git clone https://github.com/MyWebIntelligence/mwi.git
cd mwi

# Choose one mode
./scripts/docker-compose-setup.sh basic   # minimal local setup
# ./scripts/docker-compose-setup.sh api   # API-oriented mode
# ./scripts/docker-compose-setup.sh llm   # ML/embeddings/LLM mode

# Sanity check (example command)
docker compose exec mwi python mywi.py land list

Full installation details:


What MWI does (workflow)

Collect → Qualify → Analyze → Export

  1. Collect
    Build a corpus from seed URLs and curated sources, keep crawl traces, store pages + metadata.

  2. Qualify
    Extract readable content, enrich with NLP and optional LLM-based relevance gating.
    Auditability is a design goal: raw traces are kept and decisions can be inspected.

  3. Analyze
    Produce socio-semantic structures: documents, expressions/entities, similarity links, networks.

  4. Export
    Generate outputs for analysis and visualization:

  • CSV / JSON
  • GEXF (Gephi)
  • structured datasets / reports

Key concept: “Land”

A Land is a research project container (topic) holding:

  • terms, seed URLs, crawls
  • extracted content + metadata
  • enrichment layers
  • exports

Think: one Land = one case study / one dataset / one pipeline run.


Repository map (what each repo is for)

Flagship (start here)

Components (use when relevant)


Architecture (high-level)

        ┌──────────────────────────┐
        │ mwi (flagship, local)     │
        │ CLI + reproducible setup  │
        └─────────────┬────────────┘
                      │
          SQLite DB + corpus files
                      │
        ┌─────────────┴────────────┐
        │                          │
  Exports (CSV/JSON/GEXF)     Optional scale-out
  for R / Gephi / notebooks   (mywebapi: Postgres/API/Celery)
        │                          │
      mwiR as bridge          external clients/pipelines

Academic citation

Recommended practice (until stable releases are published everywhere):

  1. Cite the relevant paper(s) (HAL/publications).
  2. Cite the software using either:
    • a GitHub Release tag (preferred), or
    • a commit hash.

Recommended professionalization steps:

  • Add CITATION.cff to mwi and mwiR
  • Publish GitHub Releases (e.g., v0.1.0)
  • Archive releases to Zenodo (DOI)

Support / Contact

For research collaborations, deployments at scale, or reproducible case studies, open an issue on the flagship repository: https://github.com/MyWebIntelligence/mwi/issues


License

See each repository for licensing details (MIT where specified).

Pinned Loading

  1. My-Web-Intelligence-v2 My-Web-Intelligence-v2 Public

    Main repository (flagship). Reproducible research tool for collecting, qualifying and analyzing web corpora.

    Python 2

  2. mwiR mwiR Public

    Component repository: R analysis bridge for MWI exports. Start here: https://github.com/MyWebIntelligence/mwi

    R 3

  3. mywebapi mywebapi Public

    Component repository: experimental scalable backend (API). Start here: https://github.com/MyWebIntelligence/mwi

    Python

Repositories

Showing 7 of 7 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…