Skip to content

JonathanApprey/MarketStream

Repository files navigation

MarketStream 🚀

Real-Time Financial Data Lakehouse

A production-grade data engineering platform that ingests real-time cryptocurrency trade data, performs streaming analytics, and stores results for both real-time queries and historical research.

Scala Spark Kafka Delta Lake


🏗️ Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Binance WS    │────▶│     Kafka       │────▶│  Spark Stream   │
│   Trade Feed    │     │  market_trades  │     │   Processor     │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                                         ┌───────────────┼───────────────┐
                                         ▼               ▼               ▼
                                  ┌──────────┐    ┌──────────┐    ┌──────────┐
                                  │  Delta   │    │ ClickHouse│    │  Kafka   │
                                  │  Lake    │    │  (OLAP)   │    │ (OHLCV)  │
                                  └──────────┘    └──────────┘    └──────────┘

Components

Component Description Technology
Ingestor WebSocket client connecting to Binance trade streams Scala, Java-WebSocket
Kafka Message bus for raw trade data Apache Kafka 3.6
Processor Streaming aggregation (OHLCV) Spark Structured Streaming 3.5
Storage Historical data lake Delta Lake 3.0 + MinIO (S3)
OLAP Real-time analytics queries ClickHouse

🛠️ Prerequisites

  • Docker Desktop (with Docker Compose)
  • Java 11-17 (recommended for Spark compatibility)
    • Note: Java 25+ has compatibility issues with Spark/Hadoop. Use JAVA_HOME to point to Java 11-17.
  • SBT 1.9+
  • 8GB+ RAM allocated to Docker

🚀 Quick Start

1. Start Infrastructure

# Clone and navigate
cd MarketStream

# Start all services
docker-compose up -d

# Verify services are running
docker-compose ps

Services available:

2. Build the Project

# Compile all modules
sbt compile

# Run tests
sbt test

# Build fat JARs
sbt assembly

3. Run the Ingestor

# Start ingesting trades from Binance
sbt "ingestor/run"

You should see output like:

MarketStream Trade Ingestor Starting
Symbols to ingest: btcusdt, ethusdt, solusdt
WebSocket connected to Binance
Metrics: received=1000, published=998, dropped=2

4. Verify Kafka Messages

# In a new terminal
docker exec -it marketstream-kafka kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic market_trades_raw \
  --from-beginning \
  --max-messages 10

5. Run the Stream Processor

# In a new terminal
sbt "processor/run"

6. Query Delta Lake

# Start Spark shell
spark-shell --packages io.delta:delta-spark_2.12:3.0.0

# In Spark shell
spark.read.format("delta").load("s3a://marketstream-raw/trades").show()
spark.read.format("delta").load("s3a://marketstream-aggregated/ohlcv_1m").show()

📊 Data Quality

Run the data quality check:

sbt "processor/runMain com.marketstream.quality.QualityCheck"

Checks performed:

  • Price/Volume positive values
  • No null symbols
  • OHLCV consistency (high >= low)
  • VWAP within price range
  • Data freshness

📁 Project Structure

MarketStream/
├── build.sbt                 # Multi-module build configuration
├── docker-compose.yml        # Infrastructure setup
├── common/                   # Shared models and configuration
│   └── src/main/scala/
│       └── com/marketstream/common/
│           ├── Models.scala
│           └── Configuration.scala
├── ingestor/                 # WebSocket -> Kafka ingestion
│   └── src/
│       ├── main/scala/
│       │   └── com/marketstream/ingestor/
│       │       ├── TradeIngestor.scala
│       │       ├── BinanceWebSocketClient.scala
│       │       └── KafkaTradeProducer.scala
│       └── test/scala/
├── processor/                # Spark streaming + Delta Lake
│   └── src/
│       ├── main/scala/
│       │   └── com/marketstream/processor/
│       │       ├── StreamProcessor.scala
│       │       └── quality/QualityCheck.scala
│       └── test/scala/
└── .github/workflows/ci.yml  # CI/CD pipeline

🔧 Configuration

Environment variables override defaults:

Variable Description Default
KAFKA_BOOTSTRAP_SERVERS Kafka brokers localhost:9092
S3_ENDPOINT MinIO/S3 endpoint http://localhost:9000
S3_ACCESS_KEY S3 access key minioadmin
S3_SECRET_KEY S3 secret key minioadmin123
SPARK_MASTER Spark master URL local[*]

🧪 Testing

# Run all tests
sbt test

# Run specific module tests
sbt "ingestor/test"
sbt "processor/test"

# Run with coverage
sbt coverage test coverageReport

📈 Key Features

  • High Throughput: Handles 10,000+ trades/second
  • Exactly-Once Semantics: Kafka idempotent producer + Spark checkpointing
  • Late Data Handling: 2-minute watermark for out-of-order data
  • Data Quality: Automated validation suite
  • Scalable: Designed for horizontal scaling on Kubernetes

🎯 Portfolio Highlights

This project demonstrates proficiency in:

Skill Implementation
Scala Functional patterns, case classes, implicits
Spark Structured Streaming, watermarks, aggregations
Kafka Producer API, consumer groups, exactly-once
Delta Lake ACID transactions, time travel, schema evolution
Data Pipelines End-to-end streaming architecture
CI/CD GitHub Actions, automated testing
Docker/K8s Containerized infrastructure

📝 License

MIT License - feel free to use for your own portfolio!


🙏 Acknowledgments

  • Binance for public WebSocket API
  • Apache Spark & Delta Lake communities
  • Confluent for Kafka Docker images

About

Real-Time Financial Data Lakehouse - Scala, Spark, Kafka, Delta Lake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages