Table of Contents
The profession of a Data Analyst requires not only an understanding of business metrics and the ability to work with tables, but also a solid technical foundation, without which deep data analysis is impossible. Among the key skills of an analyst today is confident knowledge of Python. This language allows not only visualizing results or building pivot tables, but also developing full analytical pipelines — from preprocessing to hypothesis validation. That’s why, in Data Analyst interviews, Python is viewed not as an additional skill, but as an essential tool.
Employers expect candidates to know how to clean, transform, and aggregate data using Pandas, visualize key patterns, work with time series, filter and merge data sources, and eliminate errors that could affect final conclusions. It's not just about “knowing the commands,” but about understanding how to build reliable, reproducible analysis that brings real value to the business.
Data Analyst Python Interview Questions
Python is not just a language for analysis — it's the environment where the analyst’s entire workflow is built. Most tasks related to data preparation and analysis, automation of calculations, validation, and visualization are solved with Python. During interviews, candidates are evaluated on how confidently they use Pandas, NumPy, Matplotlib; how they handle dates, string variables, make selections, filter, and aggregate data. It's important to be able to explain why a specific approach is used and what impact it has on the analysis.
1. How do you typically start working with a new dataset in Python?
When I receive a new dataset, I start with an initial review: I load it into a Pandas DataFrame and examine the structure. I begin with df.info()
— this gives me an
idea of data types and missing values. Then I use df.describe()
for a numerical summary. I look at unique values of categorical variables, frequencies, and
duplicates. At this stage, I assess which variables might be targets, which need recoding, and where normalization is needed. A critical step is to check for missing data and
outliers. Only after that do I start forming hypotheses and cleaning the data. Python with Pandas is perfect for this — it lets me work transparently and efficiently.
2. How do you handle missing values in Python?
It depends on the context, but I use various approaches: sometimes I simply drop rows (dropna()
) if there are few missing values and their absence is not critical.
In other cases, I fill with the mean or median (fillna()
) if the variable is numeric with a normal or skewed distribution. For time series, I often use
ffill()
or bfill()
. For categorical variables — I either fill with the mode or create a separate “Unknown” category. Before making a decision, I always
analyze the percentage of missing values and check whether they systematically affect the target variable. All of this is done using Pandas in Python, and I try to log which
methods I use to keep the pipeline reproducible.
3. What is the difference between .loc[]
and .iloc[]
, and when do you use each?
.loc[]
is used to access rows and columns by labels, while .iloc[]
is used by positions. For example, if I want to
retrieve a row by its numeric position, I use .iloc[3]
, and if by a specific index label, then .loc['user_123']
. Same with columns:
df.loc[:, 'age']
— by name, df.iloc[:, 2]
— by position. In practice, I use .loc[]
more often because I usually work with named columns,
and it’s more readable. .iloc[]
is useful when I'm unsure of column names or working inside a loop. It’s a basic skill, but a common mistake is using
.iloc
and expecting it to work by column name — so it's important to distinguish the two.
4. How do you create visualizations in Python and which libraries do you use most often?
For basic tasks, I rely on Seaborn and Matplotlib. If I’m quickly exploring distributions, I use sns.histplot()
or boxplot()
; for categorical variables
— countplot()
. With Pandas, this is very fast. If I need customization, I switch to Matplotlib: add legends, colors, format axes. For interactive visualizations or
dashboards, I prefer Plotly — it offers great support for zoom, hover, and clicks. In Python, all these libraries integrate easily with Pandas, and visualization is part of EDA,
not a separate phase. I build plots not just “to look nice,” but to answer specific analytical questions: where are the outliers, is there a correlation, which segments stand
out.
5. How do you work with time-based data in Python?
Time-based data is one of the most interesting types. I always start by converting the column to datetime
using pd.to_datetime()
, because it unlocks
access to many useful attributes: dt.year
, dt.month
, dt.dayofweek
, etc. Then I set the index if I want to resample:
df.set_index('date')
, and then I can use .resample('M')
, .rolling('7D')
, aggregate over a time window. This is especially useful when
analyzing seasonality, trends, user behavior over time. I also frequently calculate lags using .shift()
, for example, to measure the time between purchases or
visits. Python with Pandas and datetime makes working with dates very straightforward.
6. How do you debug and verify your analytical code in Python?
I try to write clean, step-by-step code: break it into logical parts, add intermediate print()
or df.head()
where needed. I use assert
to
check dimensions, uniqueness, merge correctness. If the pipeline is complex, I use logging
instead of print
to make it scalable. When possible, I write
tests for functions using pytest
, especially for those used in production reports. I also make sure to include try/except
blocks, especially when reading from unstable data sources. Python allows for both quick-and-dirty coding and more structured, scalable code — and a good analyst should know when
to add robustness.
7. How do you merge datasets in Python, and which types of joins do you use most often?
I usually use pd.merge()
because it gives full control: you can choose the keys and specify how
(inner
, left
,
outer
, right
). Most often I use left join
, especially when the main dataset is fixed and the others are reference tables. If the key
columns have different names, I use left_on
and right_on
. I always check the dataset size before and after the join to ensure it didn’t result in a
Cartesian product. I also use concat()
for vertical or horizontal concatenation — especially when data is split by years or regions. In practice, these cases are
common, and it’s important to understand not just how to join, but what the expected result should be.
8. How do you handle outliers and what do you consider an outlier?
An outlier is a value that significantly differs from others in the same column. I often use the IQR method: calculate the 1st and 3rd quartiles and look for values beyond 1.5×IQR. I also consider business logic — for example, if a customer’s age is 300, that’s clearly a data error. I visually check using boxplots or histograms. I remove outliers only if they don’t affect the core analysis or if it’s obviously an error. Sometimes I cap values (winsorization) to preserve the sample size. It depends on the task: if the outlier distorts the metric — I remove it; if it’s rare but valid — I keep it.
9. How do you work with categorical variables in Python?
I first look at the number of unique values. If there are too many, I may simplify them (e.g., group minor categories into "Other"). If there are few, I encode them using
pd.get_dummies()
or map()
to numerical values. For modeling — I use One-Hot or Ordinal Encoding, and sometimes Target Encoding depending on the task.
For memory efficiency, I convert columns to category
type, especially for large datasets. It’s also important to check for mismatches during joins, when the same
category might be spelled differently. Cleaning categorical variables is often manual work, but Python helps automate much of it.
10. How do you interpret correlation in data, and what methods do you use?
I usually start with df.corr()
— Pearson’s correlation for numerical variables. Then I visualize it with a heatmap using Seaborn. For categorical variables, I use
Cramér’s V or check frequency tables. Correlation is not causation — it only reflects linear dependence. So I always analyze whether there might be hidden
factors influencing both variables. Sometimes I calculate partial correlations to exclude the effect of a third variable. The main thing is not just to get a number, but to
understand how it affects conclusions: should variables be used together, are they redundant in a model or report.
11. How do you choose data types for columns in Pandas, and why does it matter?
When I load data into Pandas, I first check data types using df.dtypes
. By default, Pandas assigns object
, int64
, float64
, but
that’s not always optimal. For example, if a categorical variable is a string with repeating values, I convert it to category
to reduce memory usage. It’s also
important for numeric values: if they’re small, int32
or float32
is usually enough. Optimizing types is especially relevant with large datasets — it
improves performance and read/write speed. So data types are not just a “formality” — they’re a way to make data work faster and more reliably.
12. How do you handle duplicates in a dataset, and what do you consider a duplicate?
Duplicates can be full or partial. I always start with df.duplicated()
— first by the entire row, then by key columns. If a row is fully duplicated, it’s usually a
loading or merge error, and I delete it. If only the key is duplicated and other fields differ, I investigate — it could be a version of the record or a bug. In such cases, I
prefer to keep the latest version or aggregate, depending on the task. It’s important not to delete duplicates “blindly” — I always check how they affect metrics, especially when
calculating sums or averages.
13. How do you write functions in Python when processing data?
When I notice that I’m performing the same operations — like cleaning multiple columns or applying a standard transformation — I immediately move that logic into a function. This
reduces repetition and makes the code more readable. I try to write functions that are reusable, with parameters, so I can use them in pipelines or on different datasets. I
always add a docstring — a short description of what the function does and its arguments. Sometimes I include type hints
, especially if the function is going into a
shared repository. Python is very flexible in this: I can quickly write a simple function, and later, if needed, wrap it in a class or decorator.
14. How do you handle string (text) data in Pandas?
Text data is usually of type object
, and it’s handled using .str
. I use .str.lower()
, .str.strip()
,
.str.contains()
, and .str.replace()
for cleaning. It’s especially important to remove spaces, separators, and fix casing. I often check for patterns
using regular expressions (.str.extract()
or .str.match()
). If the text variable is categorical, I convert it to category
type. I also try
to standardize formatting — for example, if values come from different sources and appear as “NYC”, “nyc”, “New York”. Texts, especially in forms or logs, almost always require
normalization before analysis.
15. How do you verify that data remains correct after transformation?
After every major transformation — filtering, aggregation, merging — I perform validation steps. For example, I compare the number of rows before and after, check whether
all key values are still present, and review basic statistics. I often use assert
to make sure, for instance, that totals remain within expected limits.
Sometimes I save intermediate results and compare them with the original. The key is — don’t blindly trust the code. Any operation can break the data, especially
joins or groupby
. That’s why I always take time to validate — and it always pays off.
16. How do you filter data in Pandas and what do you consider?
I use boolean masks: df[df['age'] > 30]
. For complex conditions, I use &
, |
, and wrap each condition in parentheses.
I avoid query()
when variables are substituted dynamically — it’s easy to make a mistake there. I also filter step by step: apply one mask, then another — it’s easier to debug that way.
It’s very important to understand how Pandas treats NaN
in filtering — they default to False
. When working with dates, I filter using df['date'] >= pd.Timestamp(...)
.
I always check the dataset size before and after filtering, especially before aggregation or training a model.
17. How do you perform data aggregation in Pandas?
I use groupby()
— it's the main tool. Most often, I use groupby().agg()
with multiple aggregations on different columns: sum
,
mean
, count
. When needed, I customize column names via dictionaries for clarity. Sometimes I use groupby().transform()
when I want
to keep the original DataFrame structure, for example, to add the group average. For time-based aggregation, I use resample()
. I also often use
.pivot_table()
for pivot tables. The main thing is to understand how the resulting index works — and if needed, reset it using reset_index()
.
18. How do you usually prepare data for visualization?
First, I filter out unnecessary rows and focus on key features. I often recalculate metrics: percentages, averages, shares, ranks — things that are easier to read visually. I also add categorical columns to use as hue or color in plots. For time-based graphs, I always sort dates and fill in missing values to avoid broken lines. Another important step: I rename columns and values so that the chart is self-explanatory. I prepare data as if someone else — not me — will use it.
19. What common mistakes do you encounter when working with Pandas, and how do you avoid them?
The first is inconsistent data types during merge — for example, joining int64
and object
columns, which results in nothing being merged.
I always check types beforehand. The second is working with a copy of a DataFrame: Pandas implicitly creates copies, and changes may not apply.
To avoid SettingWithCopyWarning
, I always use .loc[]
. The third is ignoring missing values: functions like mean()
and
sum()
ignore NaN
by default — but sometimes that matters. Another one — poorly managed indexes. I’m always attentive to DataFrame structure,
especially after a join when a MultiIndex can appear unexpectedly.
20. Which Python tools beyond Pandas do you consider essential for a data analyst?
Besides Pandas, it’s important to know datetime
, collections
(especially Counter
and defaultdict
),
and how to work with files using os
and pathlib
. Knowing how to write functions is a must — otherwise the code becomes unreadable.
Regular expressions (re
) are often lifesavers when cleaning data. I also use json
frequently — especially when working with API responses
or nested structures. Sometimes I add argparse
or click
when building command-line tools. And definitely matplotlib
—
you can draw anything in it, even when other libraries fall short. Python isn’t just about Pandas, and the broader your toolkit, the faster and cleaner you solve tasks.