MarketStream 🚀

Real-Time Financial Data Lakehouse

A production-grade data engineering platform that ingests real-time cryptocurrency trade data, performs streaming analytics, and stores results for both real-time queries and historical research.

🏗️ Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Binance WS    │────▶│     Kafka       │────▶│  Spark Stream   │
│   Trade Feed    │     │  market_trades  │     │   Processor     │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
                                         ┌───────────────┼───────────────┐
                                         ▼               ▼               ▼
                                  ┌──────────┐    ┌──────────┐    ┌──────────┐
                                  │  Delta   │    │ ClickHouse│    │  Kafka   │
                                  │  Lake    │    │  (OLAP)   │    │ (OHLCV)  │
                                  └──────────┘    └──────────┘    └──────────┘

Components

Component	Description	Technology
Ingestor	WebSocket client connecting to Binance trade streams	Scala, Java-WebSocket
Kafka	Message bus for raw trade data	Apache Kafka 3.6
Processor	Streaming aggregation (OHLCV)	Spark Structured Streaming 3.5
Storage	Historical data lake	Delta Lake 3.0 + MinIO (S3)
OLAP	Real-time analytics queries	ClickHouse

🛠️ Prerequisites

Docker Desktop (with Docker Compose)
Java 11-17 (recommended for Spark compatibility)
- Note: Java 25+ has compatibility issues with Spark/Hadoop. Use JAVA_HOME to point to Java 11-17.
SBT 1.9+
8GB+ RAM allocated to Docker

🚀 Quick Start

1. Start Infrastructure

# Clone and navigate
cd MarketStream

# Start all services
docker-compose up -d

# Verify services are running
docker-compose ps

Services available:

Kafka UI: http://localhost:8082
Spark Master: http://localhost:8080
MinIO Console: http://localhost:9001 (login: minioadmin/minioadmin123)
ClickHouse: http://localhost:8123

2. Build the Project

# Compile all modules
sbt compile

# Run tests
sbt test

# Build fat JARs
sbt assembly

3. Run the Ingestor

# Start ingesting trades from Binance
sbt "ingestor/run"

You should see output like:

MarketStream Trade Ingestor Starting
Symbols to ingest: btcusdt, ethusdt, solusdt
WebSocket connected to Binance
Metrics: received=1000, published=998, dropped=2

4. Verify Kafka Messages

# In a new terminal
docker exec -it marketstream-kafka kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic market_trades_raw \
  --from-beginning \
  --max-messages 10

5. Run the Stream Processor

# In a new terminal
sbt "processor/run"

6. Query Delta Lake

# Start Spark shell
spark-shell --packages io.delta:delta-spark_2.12:3.0.0

# In Spark shell
spark.read.format("delta").load("s3a://marketstream-raw/trades").show()
spark.read.format("delta").load("s3a://marketstream-aggregated/ohlcv_1m").show()

📊 Data Quality

Run the data quality check:

sbt "processor/runMain com.marketstream.quality.QualityCheck"

Checks performed:

Price/Volume positive values
No null symbols
OHLCV consistency (high >= low)
VWAP within price range
Data freshness

📁 Project Structure

MarketStream/
├── build.sbt                 # Multi-module build configuration
├── docker-compose.yml        # Infrastructure setup
├── common/                   # Shared models and configuration
│   └── src/main/scala/
│       └── com/marketstream/common/
│           ├── Models.scala
│           └── Configuration.scala
├── ingestor/                 # WebSocket -> Kafka ingestion
│   └── src/
│       ├── main/scala/
│       │   └── com/marketstream/ingestor/
│       │       ├── TradeIngestor.scala
│       │       ├── BinanceWebSocketClient.scala
│       │       └── KafkaTradeProducer.scala
│       └── test/scala/
├── processor/                # Spark streaming + Delta Lake
│   └── src/
│       ├── main/scala/
│       │   └── com/marketstream/processor/
│       │       ├── StreamProcessor.scala
│       │       └── quality/QualityCheck.scala
│       └── test/scala/
└── .github/workflows/ci.yml  # CI/CD pipeline

🔧 Configuration

Environment variables override defaults:

Variable	Description	Default
`KAFKA_BOOTSTRAP_SERVERS`	Kafka brokers	`localhost:9092`
`S3_ENDPOINT`	MinIO/S3 endpoint	`http://localhost:9000`
`S3_ACCESS_KEY`	S3 access key	`minioadmin`
`S3_SECRET_KEY`	S3 secret key	`minioadmin123`
`SPARK_MASTER`	Spark master URL	`local[*]`

🧪 Testing

# Run all tests
sbt test

# Run specific module tests
sbt "ingestor/test"
sbt "processor/test"

# Run with coverage
sbt coverage test coverageReport

📈 Key Features

High Throughput: Handles 10,000+ trades/second
Exactly-Once Semantics: Kafka idempotent producer + Spark checkpointing
Late Data Handling: 2-minute watermark for out-of-order data
Data Quality: Automated validation suite
Scalable: Designed for horizontal scaling on Kubernetes

🎯 Portfolio Highlights

This project demonstrates proficiency in:

Skill	Implementation
Scala	Functional patterns, case classes, implicits
Spark	Structured Streaming, watermarks, aggregations
Kafka	Producer API, consumer groups, exactly-once
Delta Lake	ACID transactions, time travel, schema evolution
Data Pipelines	End-to-end streaming architecture
CI/CD	GitHub Actions, automated testing
Docker/K8s	Containerized infrastructure

📝 License

MIT License - feel free to use for your own portfolio!

🙏 Acknowledgments

Binance for public WebSocket API
Apache Spark & Delta Lake communities
Confluent for Kafka Docker images

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
clickhouse/init		clickhouse/init
common/src/main		common/src/main
ingestor/src		ingestor/src
processor/src		processor/src
project		project
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MarketStream 🚀

🏗️ Architecture

Components

🛠️ Prerequisites

🚀 Quick Start

1. Start Infrastructure

2. Build the Project

3. Run the Ingestor

4. Verify Kafka Messages

5. Run the Stream Processor

6. Query Delta Lake

📊 Data Quality

📁 Project Structure

🔧 Configuration

🧪 Testing

📈 Key Features

🎯 Portfolio Highlights

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

JonathanApprey/MarketStream

Folders and files

Latest commit

History

Repository files navigation

MarketStream 🚀

🏗️ Architecture

Components

🛠️ Prerequisites

🚀 Quick Start

1. Start Infrastructure

2. Build the Project

3. Run the Ingestor

4. Verify Kafka Messages

5. Run the Stream Processor

6. Query Delta Lake

📊 Data Quality

📁 Project Structure

🔧 Configuration

🧪 Testing

📈 Key Features

🎯 Portfolio Highlights

📝 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages