The Relational Database Ceiling
When building a new application, developers naturally default to relational databases (like traditional relational databases). These databases are brilliant at handling transactional operations (OLTP): updating a user's password, debiting a balance, or creating a new record.
However, as an application scales, it inevitably begins generating massive volumes of analytical telemetry.
The Analytics Bottleneck: If a high-volume platform like MyFunnelAPI is serving millions of visitors, tracking every page view, click, and conversion event requires writing hundreds of rows per second to the database.
Relational databases are structurally incapable of handling this kind of analytical scale (OLAP) cost-effectively. As the table grows to billions of rows, simple COUNT() or GROUP BY queries begin taking minutes to execute, locking up the database and causing widespread application degradation.
The Shift to the Data Lake
To solve this, modern engineering teams decouple their transactional databases from their analytical systems. They build Data Lakes.
Historically, building a data lake required expensive, specialized data warehousing solutions (like Redshift or Snowflake). Today, the most cost-effective and scalable architectural pattern is to build the data lake directly on top of raw Object Storage.
Instead of writing telemetry events as rows in a database table, the edge network batches the JSON events and writes them as raw files into a globally distributed storage bucket.
The Power of Columnar Formats (Parquet)
Writing raw JSON files into a bucket is highly inefficient for querying. If you need to calculate the total number of clicks across a billion JSON events, the query engine must parse every single JSON file, load the entire object into memory, and extract the click_count field.
To optimize the Data Lake, the raw JSON batches are converted into Columnar Formats, primarily Apache Parquet.
In a traditional row-oriented format (like JSON or CSV), data is stored row by row. In a columnar format like Parquet, data is stored column by column.
The Compression Advantage
Because all the data in a single column (e.g.,timestamp) is of the same type, Parquet can apply highly aggressive compression algorithms. A 100GB dataset in JSON is often compressed down to just 15GB in Parquet.
The Query Advantage
If an analyst runs a query to find the average conversion rate:SELECT AVG(conversion_rate) FROM events WHERE date = '2026-03-01', the query engine does not need to scan the entire dataset.
Because Parquet stores the schema and statistics (min/max values) at the end of the file, the query engine can instantly skip over massive chunks of irrelevant data, reading only the specific blocks containing the conversion_rate column for that specific date.
The Serverless Query Engine
Once the data is stored as Parquet files in the object storage bucket, how do you actually query it without a database?
The architecture relies on Serverless Query Engines (such as AWS Athena, Google BigQuery, or edge-native SQL interfaces). These engines are completely decoupled from the storage layer.
When a query is executed, the engine dynamically spins up thousands of ephemeral compute workers. These workers connect directly to the Object Storage API, scan the necessary Parquet chunks in parallel, aggregate the results, and spin down.
-- Example: Querying raw Parquet files directly from Object Storage
SELECT
tenant_id,
COUNT(*) as total_events
FROM
s3://my-analytics-bucket/events/year=2026/month=03/
WHERE
event_type = 'conversion'
GROUP BY
tenant_id
ORDER BY
total_events DESC;
By leveraging Object Storage as the foundational layer of a Data Lake, engineering teams achieve effectively infinite scalability. They pay pennies per gigabyte for storage, and only pay for the precise compute cycles required to execute complex analytical queries.