What Scale of Data Have You Worked With in Your Career?
This question comes up in almost every data engineering interview because scale changes everything — the tools you use, the architectural decisions you make, and the problems you run into. Interviewers aren’t just asking for a number. They want to know whether you’ve actually worked at a scale that required real engineering decisions, or whether you’ve mostly been doing analysis-level work with manageable datasets.
Why This Question Gets Asked
As of 2025, the distinction between small, medium, and large-scale data has become even more important. Teams working with AI and ML pipelines are handling model training datasets in the terabyte range. Real-time event-driven architectures are processing hundreds of millions of events per day. Data lakehouses are consolidating structured and unstructured data at petabyte scale. Interviewers want to know where on that spectrum your experience sits.
STAR Answer Example
Situation
My experience spans several different data scales across different roles. The most technically demanding scale I’ve worked at was during a position at a large e-commerce company where I was part of the platform data engineering team responsible for event data from user sessions, transactions, and third-party integrations.
Task
My primary responsibility was to maintain and improve the ingestion and transformation pipelines that fed both our real-time recommendation system and our batch-oriented data warehouse. The data volume was significant — roughly 800 million events per day at peak, with the raw event store sitting at around 12 petabytes across a multi-region AWS S3 data lake.
Action
At that scale, the standard approaches you’d use on smaller datasets simply don’t hold up. A few specific examples of how scale shaped my decisions:
For the batch side, we used Apache Spark running on EMR to process daily event aggregations. Even with Spark, naive implementations were too slow. I spent considerable time tuning partition sizes — getting the partition count right was crucial because too few partitions meant each executor was processing too much data, and too many meant excessive shuffle overhead. I worked through a bottleneck where our most downstream aggregation job was taking 11 hours. After profiling the Spark DAG, I found that a single skewed join was causing most of the delay. I addressed it using a salting technique on the join key, which distributed the load more evenly and brought that job down to under 3 hours.
For real-time data, we ingested from Kafka at roughly 2 million messages per second during peak traffic. I maintained the Kafka consumer groups that fed our Flink streaming jobs and was responsible for managing consumer lag. When we launched a major promotional event, our incoming event rate tripled over a 4-hour window. I had pre-configured auto-scaling for the Flink job operators, but I also had a runbook for manually adjusting Kafka partition counts and rebalancing consumer groups when lag exceeded our alerting threshold. That experience taught me to think about headroom, not just steady-state performance.
On the storage side, I worked closely with the data platform team on partitioning strategy for the S3 data lake. We partitioned by event date and event type, and I ran experiments with different file formats — comparing Parquet versus ORC for our specific query patterns in Athena. Parquet with Snappy compression consistently won on both cost and query speed for our use case.
Result
The Spark optimization work I led reduced our nightly batch window by about 5 hours, which unlocked a 7 AM reporting SLA that the business had been asking for. The Kafka auto-scaling and runbook work meant the promotional event processed without a single consumer lag breach. My file format analysis saved roughly $18,000 per month in Athena query costs by reducing the data scanned per query by around 40%.
Talking About Multiple Scales
Most engineers have worked across several scales rather than just one, and that’s worth acknowledging in your answer. Here’s how to frame different scales concisely if you’ve had varied experience:
Small-scale (MBs to a few GBs): Good for rapid prototyping and exploratory work. Pandas and local SQL are sufficient. The value here is speed of iteration, not engineering sophistication.
Medium-scale (GBs to a few TBs): You start needing distributed computing or a well-tuned relational warehouse. This is where Spark or Redshift starts earning its cost.
Large-scale (TBs to PBs): Architecture choices matter enormously. Partitioning strategy, columnar formats, cost of compute, and query engine selection become first-class concerns.
Streaming at scale: Volume is only part of the challenge. Latency requirements, exactly-once semantics, and consumer group management add a different category of complexity.
What Interviewers Are Listening For
When you answer this question, they’re watching for a few things:
- Do you talk about specific numbers? Vague answers like “large amounts of data” are a red flag. Real experience usually comes with specifics.
- Do you mention the constraints that came with scale? Anyone can say they’ve used Spark. Fewer people can explain why partition tuning matters.
- Can you describe tradeoffs you made? Scale forces you to make architectural decisions. If you’ve made them, you should be able to explain them.
- Did scale change your approach? The most convincing answers show how larger data forced you to think differently.
Be specific, anchor your answer in real numbers where possible, and connect the work to a business outcome. That’s what separates a strong answer from a credential recitation.