Python Interview Questions for Data Engineer

Table of Contents

Data Engineering is not just about writing ETL scripts. It's about designing robust, scalable, and reliable systems that ensure high-quality data flow between sources and storage systems, from relational databases to analytical warehouses, from logs to machine learning models. In the modern toolset, Python plays a key role — from transformation and validation to pipeline orchestration and working with cloud storage. That’s why most technical interviews for Data Engineer positions include deep Python knowledge assessment.

Questions may cover both basic topics — working with CSVs, serialization, using Pandas — and more advanced areas: generators, threading, pipeline architecture, interaction with S3, Spark, Dask, and even structuring code for production. To succeed in an interview, knowing syntax isn’t enough — you need to justify architectural decisions, explain approaches to processing big data, and speak the language of efficiency, reliability, and code readability.

Data Engineer Python Interview Questions

Python is considered the primary language in the Data Engineer toolkit due to its flexibility, vast ecosystem, and integration with modern data processing frameworks. Interviews focus on practical aspects: working with large data volumes, stream processing, storage formats, working with Pandas, Dask, S3, and building ETL processes. Questions cover not only syntax but also architecture: how to optimize code, reduce memory usage, and make pipelines fault-tolerant. Knowledge of generators, data types, partitioning, and time series handling is a strong competitive advantage.

1. What is the difference between a list and a generator, and when is it better to use a generator?

A list is a data structure that stores all elements in memory at once. It's convenient when you need index access or multiple iterations.
generator is an iterator that yields values one by one using yield, without loading everything into memory. It is lazy: elements are computed on the fly and disappear after iteration.

When to use a generator:

when working with large data volumes (files, APIs, databases);
if data is processed step by step without the need to store it all;
when memory usage needs to be minimized.

Generators are especially useful in data processing pipelines and ETL workflows, where each step works with a stream instead of a full dataset.

2. How to use Pandas to load large files without running out of memory?

By default, Pandas loads the entire file into memory, which is inefficient for large datasets. To avoid OOM, use chunked reading via the chunksize parameter in read_csv():

 
    for chunk in pd.read_csv("data.csv", chunksize=100_000):
      process(chunk)

Each chunk is a DataFrame you can process separately. You can also:

use usecols to load only needed columns;
specify dtype to avoid unnecessary memory usage;
use iterator=True and next() for fine-grained control.

This approach allows processing files tens of gigabytes in size without memory overload.

3. How does datetime work in Pandas and how to efficiently parse dates?

Pandas provides Timestamp and DatetimeIndex classes that extend the standard datetime and allow vectorized time operations.

To parse dates during file loading:

pd.read_csv("data.csv", parse_dates=["date_column"])

Or convert manually:

df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")

Efficiency tips:

specifying exact format speeds up parsing significantly;
errors="coerce" helps safely handle invalid strings;
you can extract .dt.year, .dt.month, .dt.day, .dt.hour, and group by time.

Proper time data handling is critical for aggregations, window functions, and time series analysis.

4. How to read and write data to S3 using Python?

To work with Amazon S3 in Data Engineering, the most commonly used libraries are:

boto3 — AWS official SDK;
s3fs + pandas — for reading CSV, Parquet directly;
smart_open — for universal cloud storage access.

You can also use s3fs as a file system.

Important: for streaming operations, use BytesIO + boto3 (especially for writing large files).

5. How does Spark work with Python and what is the difference between DataFrame API and RDD?

PySpark is the Python interface for Apache Spark. Spark has two main APIs:

RDD — low-level API based on functions like map, filter, reduce. Requires deep understanding of distributed processing.
DataFrame API — high-level interface, similar to SQL. It is optimized via Catalyst and Tungsten and is preferred for most tasks.

Advantages of DataFrame:

readability;
planner-level optimizations;
ability to write SQL queries;
better suited for ETL tasks and aggregations.

For a Data Engineer, it's best to use the DataFrame API unless low-level transformations or control are specifically needed.

6. How to apply window functions in Pandas?

Window functions in Pandas are implemented via .rolling(), .expanding(), and .groupby().transform():

df["rolling_avg"] = df["value"].rolling(window=3).mean()

Also supported: .shift(), .rank(), .cumsum(), and others.

For complex windows:


    df["lag_1"] = df.groupby("user")["value"].shift(1)
    df["diff"] = df["value"] - df["lag_1"]

Or aggregate over a window:

df["mean"] = df.groupby("group")["value"].transform(lambda x: x.rolling(3, min_periods=1).mean())

Window functions are essential in time series analysis, user behavior evaluation, and metric computation.

7. How to handle missing values in Pandas?

Pandas provides many tools:

isnull(), notnull() — for filtering;
dropna() — to remove rows/columns;
fillna() — to fill with a constant, method (ffill, bfill), mean or median.

You can also use interpolate() — linear interpolation between values.

The key is understanding the nature of the missing values: random or systematic. In Data Engineering, missing data should be handled consistently and transparently.

8. What is the difference between Parquet and CSV, and when to use each?

CSV — text format, one line = one record. Simple to use but inefficient in size and speed.
Parquet — binary columnar format. Supports compression, typing, and predicate pushdown.

Advantages of Parquet:

faster read performance;
smaller file size;
can read only selected columns (pushdown);
ideal for processing in Spark, Hive, Presto.

Use CSV only for debugging and simple transfers, and Parquet for all production processes and data lake storage.

9. How to implement a data processing pipeline in Python?

A pipeline is a sequence of steps: extract, validate, transform, aggregate, export.

You can use:

prefect, airflow, luigi — for orchestration;
pandas/dask/pyarrow — for transformations;
click/argparse — for CLI pipelines.

Basic implementation:


    def extract(): ...
    def transform(df): ...
    def load(df): ...

    if __name__ == "__main__":
        df = extract()
        df = transform(df)
        load(df)

Important: a pipeline should be fault-tolerant, logged, parameterized, and testable. Often — modular, with checkpoint-based restart support.

10. What technology stack do you use in production ETL and why?

In real-world projects, the stack depends on the task but often includes:

Python — the main language, flexible with a rich ecosystem;
Pandas/Dask — for transformations;
Spark/PySpark — for distributed processing;
Airflow/Prefect — for scheduling;
S3/MinIO/HDFS — as storage;
PostgreSQL/ClickHouse — for analytics;
Docker/K8s — for containerization;
Great Expectations — for data validation;
DBT — for SQL-based models.

Rationale: scalability, stability, strong community, and monitoring convenience. The stack must be modular, easily upgradable, and support data quality control.

11. What is the difference between Dask and Pandas, and when should Dask be used?

Dask is a library for parallel and distributed data processing with an API compatible with Pandas. The key difference: Dask works with data that doesn't fit into memory, splitting it into partitions and processing in parallel.

Key differences:

Pandas — in-memory, single-threaded processing;
Dask — lazy evaluation, supports multithreading and multiprocessing, scalable across clusters.

When to use Dask:

when data size exceeds 5–10 GB;
for parallel loading and pre-aggregation;
in production ETL where execution time matters.

Dask also integrates well with NumPy, Scikit-Learn, Spark, and can serve as a drop-in replacement for Pandas in pipelines without rewriting code.

12. What data storage formats have you used and how do they differ?

In practice, these are commonly used:

CSV — text-based, simple, inefficient in size, lacks schema support;
JSON — text-based, supports nested structures, slow on large volumes;
Parquet — binary, columnar, compressible, efficient for analytical queries;
Avro — binary, schema-based, optimized for serialization and streaming;
ORC — alternative to Parquet, efficient in the Hadoop ecosystem;
Feather/Arrow — binary formats for fast interprocess communication.

Choice depends on:

data volume,
read/write frequency,
compatibility requirements (Spark, Hive, Presto),
need for compression or schema support.

In production, use Parquet for lake storage, Avro or JSON for structured transfer, and CSV for prototyping.

13. How do you validate data quality in ETL?

Validation is a critical part of the data pipeline. You should check:

data types (int, float, datetime);
value ranges (e.g., no future dates);
uniqueness/duplicates;
absence of NULLs;
business rules (e.g., amount > 0);
reference list consistency.

Tools:

assert/try/except — in simple scripts;
Great Expectations — full DSL system with JSON/YAML config, supports Pandas, Spark, SQL;
Pandera — type annotations and validation for Pandas DataFrames;
Cerberus, Pydantic — for JSON and dictionaries.

A good pipeline should not only validate but also log violations, send alerts, and allow flexible configuration of checks.

14. How to efficiently aggregate data in Pandas?

To aggregate data, use .groupby():

df.groupby("user_id")["amount"].sum()

Multiple aggregations:


    df.groupby("category").agg({
      "amount": ["mean", "sum"],
      "price": "max"
    })

Also available:

.pivot_table() — for pivot tables;
.resample() — for time-based aggregation;
.rolling() — for windowed aggregations;
.crosstab() — for cross-tabulation.

Important tips:

avoid unnecessary .reset_index();
ensure proper sorting before aggregation;
optimize types (use category), especially when grouping by strings.

Pandas is great for aggregating up to 10–20 million rows. Beyond that — use Dask or Spark.

15. How to debug and test a data pipeline?

A data pipeline, like any code, needs to be tested:

Unit tests — for individual transformation functions (pytest);
Integration tests — read → process → write with mock data;
Data quality tests — anomaly detection (e.g., zero records, sudden spikes);
Property-based testing — hypotheses (e.g., sum > 0, date <= today).

Debugging techniques:

logging (logging, loguru);
saving intermediate files/chunks;
visual checks (df.head(), df.info()).

Also use static analysis (flake8, mypy), CI/CD with synthetic data runs, and data validation via Great Expectations.

16. What’s the difference between join and merge in Pandas, and what join types exist?

merge() and join() in Pandas are essentially similar, but:

merge() — a general-purpose method for joining on arbitrary columns;
join() — a shorter syntax, operates on indices or a default column.

Supported join types:

inner — matching records only;
left/right — all from one table + matching from the other;
outer — full union of both tables.

Important considerations:

check for duplicate keys to avoid Cartesian products;
ensure column type compatibility;
use indexes when working with join().

17. How to track dependencies and execution order in an Airflow DAG?

In Airflow, a DAG (Directed Acyclic Graph) is a set of Tasks connected via dependencies:

task1 >> task2  # task1 runs before task2

Dependencies can be defined:

using >> and << operators;
with set_upstream() / set_downstream() methods;
using @task decorators in TaskFlow API.

Key points:

DAG must be acyclic;
execution order is not guaranteed without explicit dependencies;
use TriggerRule (e.g., all_done, one_success) for branching.

Dependencies with files and external resources can also be tracked using ExternalTaskSensor, FileSensor, TimeDeltaSensor.

A good DAG structure is readable, logical, minimally nested, and includes failure handling.

18. How to work with time series in Pandas?

Pandas provides powerful tools:

pd.to_datetime() — convert to datetime format;
set_index("date") + resample("1D") — date-based aggregation;
.rolling() — window functions;
.shift() — lags;
.diff() — difference between periods;
.asfreq() — frequency conversion.

Other operations include:

interpolation of missing values;
calculation of moving averages;
concatenation of data with different frequencies using reindex().

Time series analysis is common in logs, financial data, and metric tracking.

19. How do window functions work in SQL and how to implement them in Python?

Window functions in SQL: ROW_NUMBER(), RANK(), LAG(), LEAD(), SUM() OVER (PARTITION BY...).

Pandas equivalents:

df.groupby(...).cumcount() — ROW_NUMBER();
df.groupby(...).cumsum() — SUM OVER;
shift() — LAG/LEAD;
rank() — RANK/DENSE_RANK.

You can also use .transform() + .rolling() for windowed aggregations.

Understanding window functions is important for ranking, calculating rolling metrics, and building analytical data marts.

20. How to choose optimal data types in Pandas?

By default, Pandas uses 64-bit types, which can be excessive. Type optimization:

int64 → int32/int16;
float64 → float32;
object → category when few unique values;
bool → for binary columns;
datetime64 → for datetime values.

To check memory usage:

df.info(memory_usage="deep")

To convert types:


    df["region"] = df["region"].astype("category")
    df["views"] = df["views"].astype("int32")

The result — memory savings of 5–10×, especially for large DataFrames. This is critical when working with tens of millions of rows.