Building an Analytics Data Lake on Object Storage

Label	Agent ID	Balance	Last Active	Actions
Loading...

Name	Prefix	Created	Last Used	Actions
Loading...

The Relational Database Ceiling

When building a new application, developers naturally default to relational databases (like traditional relational databases). These databases are brilliant at handling transactional operations (OLTP): updating a user's password, debiting a balance, or creating a new record.

However, as an application scales, it inevitably begins generating massive volumes of analytical telemetry.

The Analytics Bottleneck: If a high-volume platform like MyFunnelAPI is serving millions of visitors, tracking every page view, click, and conversion event requires writing hundreds of rows per second to the database.

Relational databases are structurally incapable of handling this kind of analytical scale (OLAP) cost-effectively. As the table grows to billions of rows, simple COUNT() or GROUP BY queries begin taking minutes to execute, locking up the database and causing widespread application degradation.

The Shift to the Data Lake

To solve this, modern engineering teams decouple their transactional databases from their analytical systems. They build Data Lakes.

Historically, building a data lake required expensive, specialized data warehousing solutions (like Redshift or Snowflake). Today, the most cost-effective and scalable architectural pattern is to build the data lake directly on top of raw Object Storage.

Instead of writing telemetry events as rows in a database table, the edge network batches the JSON events and writes them as raw files into a globally distributed storage bucket.

The Power of Columnar Formats (Parquet)

Writing raw JSON files into a bucket is highly inefficient for querying. If you need to calculate the total number of clicks across a billion JSON events, the query engine must parse every single JSON file, load the entire object into memory, and extract the click_count field.

To optimize the Data Lake, the raw JSON batches are converted into Columnar Formats, primarily Apache Parquet.

In a traditional row-oriented format (like JSON or CSV), data is stored row by row. In a columnar format like Parquet, data is stored column by column.

The Compression Advantage

Because all the data in a single column (e.g., timestamp) is of the same type, Parquet can apply highly aggressive compression algorithms. A 100GB dataset in JSON is often compressed down to just 15GB in Parquet.

The Query Advantage

If an analyst runs a query to find the average conversion rate: SELECT AVG(conversion_rate) FROM events WHERE date = '2026-03-01', the query engine does not need to scan the entire dataset.

Because Parquet stores the schema and statistics (min/max values) at the end of the file, the query engine can instantly skip over massive chunks of irrelevant data, reading only the specific blocks containing the conversion_rate column for that specific date.

The Serverless Query Engine

Once the data is stored as Parquet files in the object storage bucket, how do you actually query it without a database?

The architecture relies on Serverless Query Engines (such as AWS Athena, Google BigQuery, or edge-native SQL interfaces). These engines are completely decoupled from the storage layer.

When a query is executed, the engine dynamically spins up thousands of ephemeral compute workers. These workers connect directly to the Object Storage API, scan the necessary Parquet chunks in parallel, aggregate the results, and spin down.

-- Example: Querying raw Parquet files directly from Object Storage
SELECT 
  tenant_id, 
  COUNT(*) as total_events 
FROM 
  s3://my-analytics-bucket/events/year=2026/month=03/
WHERE 
  event_type = 'conversion'
GROUP BY 
  tenant_id
ORDER BY 
  total_events DESC;

By leveraging Object Storage as the foundational layer of a Data Lake, engineering teams achieve effectively infinite scalability. They pay pennies per gigabyte for storage, and only pay for the precise compute cycles required to execute complex analytical queries.

Log In

Agent Login

Dashboard

Your Agents

API Keys

Add Payment Method

Top Up Balance

The Relational Database Ceiling

The Shift to the Data Lake

The Power of Columnar Formats (Parquet)

The Compression Advantage

The Query Advantage

The Serverless Query Engine

Object storage without the AWS headache.

Up and running in minutes

Agent Route (cURL)

Human Route (Dashboard)

Register an Account

Human Route (Dashboard)

Sign Up / Sign In

Create an API Key *

Open Dashboard

Make an API Request

Built for every stage of the lifecycle

Simple, transparent pricing

API Reference

Log In

Agent Login

Dashboard

Your Agents

API Keys

Add Payment Method

Top Up Balance

The Relational Database Ceiling

The Shift to the Data Lake

The Power of Columnar Formats (Parquet)

The Compression Advantage

The Query Advantage

The Serverless Query Engine

Object storage without the AWS headache.

Up and running in minutes

Agent Route (cURL)

Human Route (Dashboard)

Register an Account

Human Route (Dashboard)

Sign Up / Sign In

Create an API Key *

Open Dashboard

Make an API Request

Built for every stage of the lifecycle

Simple, transparent pricing

API Reference

Create your account

Link Agent

Create API Key

Migrate to Human Account