Data Engineering & Databases

The deconstructed database at Datadog

Datadog has grown from a startup focused on infrastructure monitoring into a platform processing over a hundred trillion events daily. Over the years, we expanded beyond metrics to include traces, logs, profiling, real user monitoring, and security. Our user base has broadened from operations to developers, analysts, and business users. More recently, automated agents have also become key consumers of our platform. Our bottom-up culture encourages teams to take initiative. Consequently, we developed multiple specialized ingestion pipelines and query engines. These were built to satisfy strict real-time requirements and interactive experience, providing users with the insights necessary for success. This focus on efficiency led to custom-built, proprietary solutions designed for our unique constraints. Today, however, the evolving landscape allows us to reconcile these specialized engines with open standards, blending the versatility of the ecosystem with our purpose-built designs. In recent years, we have refactored the interfaces of these query engines to create a composable data system. This allows us to better leverage shared capabilities, enabling cross-dataset querying, advanced analytics, and more versatile access patterns. Our goal is to scale our bottom up culture. By defining clear contracts and high-level components, we enable decentralized decision-making. This improves performance, efficiency, and flexibility across the platform, while reducing silos. By adopting a deconstructed stack, we combine the efficiency of the open-source ecosystem with our internal capabilities to build a truly composable system. This architecture provides the flexibility to adapt to immediate and future demands, specifically addressing requirements for scale, velocity, and operational resilience, while ensuring readiness for growing challenges such as data intensive operations like AI. In this talk, we will discuss how we rely on and contribute to key projects in the data ecosystem: Arrow for data interchange, Substrait for plans, Calcite as an optimizer, DataFusion as an execution core, and Parquet for columnar storage.

Speakers