Pick a topic below or use the full list to practice end-to-end.
| Databases and Data Warehouses | |||||
|---|---|---|---|---|---|
| GitHub Repo | Official page | Questions | Description | Useful links | |
| Apache Cassandra | Cassandra is a distributed, wide-column store, NoSQL database management system. | Awesome Cassandra | |||
| Greenplum | Greenplum is a big data technology based on MPP architecture and the Postgres open source database technology. | Awesome Greenplum | |||
| MongoDB | MongoDB is a document-oriented database. | Awesome MongoDB | |||
| Apache Hbase | HBase is an open-source non-relational distributed database. | Awesome HBase | |||
| Apache Hive | Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. | Awesome Hive | |||
| Amazon DynamoDB | Amazon DynamoDB is a fully managed proprietary NoSQL database service. | Awesome DynamoDB Awesome AWS | |||
| Amazon Redshift | Amazon Redshift is a data warehouse product. | Amazon Redshift Utilities Awesome AWS | |||
| BigQuery GCP | BigQuery is a fully-managed, serverless data warehouse. | Awesome BigQuery | |||
| Bigtable GCP | Bigtable is a fully managed wide-column and key-value NoSQL database service. | Awesome Bigtable | |||
| Data Formats | |||||
| Apache Avro | Avro is a row-oriented remote procedure call and data serialization framework. | Awesome Avro | |||
| Apache Parquet | Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. | Parquet format · Docs | |||
| Delta | Delta Lake is a storage framework that enables building a Lakehouse architecture with compute engines | Delta examples | |||
| Apache Iceberg | Apache Iceberg is an open table format for huge analytic datasets. | Iceberg docs | |||
| Apache Hudi | Apache Hudi brings upserts, deletes, and incremental processing to data lakes. | Hudi docs | |||
| Big Data Frameworks | |||||
| Apache Airflow | Apache Airflow is a workflow management platform for data engineering pipelines. | Awesome Airflow | |||
| Apache Flume | Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. | Flume User Guide | |||
| Apache Hadoop | Apache Hadoop is a collection of software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. | Awesome Hadoop | |||
| Apache Impala | Apache Impala is a parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop. | Impala docs | |||
| Apache Kafka | Apache Kafka is a distributed event store and stream-processing platform. | Awesome Kafka | |||
| Apache NiFi | Apache NiFi is a software project designed to automate the flow of data between software systems. | Awesome NiFi | |||
| Apache Spark | Apache Spark is unified analytics engine for large-scale data processing. | Awesome Spark | |||
| Apache Flink | Apache Flink is unified stream-processing and batch-processing framework. | Awesome Flink | |||
| Kubernetes | Kubernetes is a system for managing containerized applications across multiple hosts. | Awesome Kubernetes | |||
| Cloud providers | |||||
| Amazon Web Services | Amazon web service is an online platform that provides scalable and cost-effective cloud computing solutions. | Awesome AWS | |||
| Microsoft Azure | Microsoft Azure is Microsoft's public cloud computing platform. | Awesome Azure | |||
| Google Cloud Platform | Google Cloud Platform is a suite of cloud computing services. | Awesome GCP | |||
| Modern Data Stack | |||||
| dbt | dbt is a transformation framework for building tested and documented SQL models. | dbt tests | |||
| Theory | |||||
| DWH Architectures | A data warehouse architecture is a method of defining the overall architecture of data communication processing and presentation that exist for end-clients computing within the enterprise. | Awesome databases | |||
| Change Data Capture (CDC) | CDC captures inserts/updates/deletes from source systems for low-latency ingestion. | Debezium docs | |||
| Data Modeling | Dimensional modeling concepts used to build reliable analytics datasets. | Kimball Group | |||
| Data Quality | Tests, monitoring, and practices to ensure datasets are trusted and correct. | Great Expectations docs | |||
| Data Observability | Monitoring and incident response practices for pipeline and dataset health. | OpenLineage | |||
| Data Governance | Ownership, policies, privacy, and access controls for data platforms. | DataHub | |||
| Cost Optimization | Practical techniques to reduce compute and storage costs while meeting SLAs. | Spark tuning | |||
| Python for Data Engineering | Python fundamentals for reliable, scalable data pipelines and tooling. | PyArrow docs | |||
| Data System Design | System design interview questions for batch/streaming data platforms. | Data mesh overview | |||
| Data Structures | A data structure is a specialized format for organizing, processing, retrieving and storing data. | Awesome Algorithms | |||
| SQL | SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). | Awesome SQL | |||
| Data visualization tools/BI | |||||
| Tableau | Tableau is a powerful data visualization tool used in the Business Intelligence. | Tableau Desktop docs | |||
| Looker | Looker is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time. | Looker docs | |||
| Apache Superset | Superset is a modern data exploration and data visualization platform | Superset docs | |||
Please contribute to this repository to help it make better. Any change like new question, code improvement, doc improvement etc is very welcome.
See CONTRIBUTING.md for quick checks and guidelines.