Most data science interview failures happen on the fundamentals, not the fancy algorithms. Hiring managers consistently report that candidates who falter aren't tripped up by transformer architectures—they mishandle a groupby, blank on what p-value actually means, or write SQL that works but scans the entire table. This data science cheat sheet covers what you actually need to have internalized: Python, SQL, statistics, and ML basics, with enough detail to jog your memory and enough precision to be actually useful.
What This Data Science Cheat Sheet Covers
A cheat sheet only works if it's opinionated. This one covers the four areas that show up in virtually every data science role regardless of industry:
- Python (pandas, NumPy, scikit-learn): The lingua franca of data work
- SQL: Still the fastest way to answer most business questions
- Statistics & probability: The theory underneath every model you'll build
- Machine learning fundamentals: The concepts you need to use—and explain—correctly
If you find yourself fuzzy on entire sections, the course recommendations at the end of this page are worth your time. The goal isn't memorization—it's having a strong enough mental model that you can reconstruct the syntax without Googling every line.
Python Cheat Sheet for Data Science
pandas: The Operations You'll Use Daily
Most data work in pandas reduces to a handful of patterns. These are the ones worth having automatic:
df.shape,df.dtypes,df.describe()— first three things you run on any new datasetdf.isnull().sum()— count missing values per columndf.dropna()vsdf.fillna(value)— drop rows with nulls vs impute themdf[df['col'] > threshold]— boolean filtering; chains cleanly with&and|df.groupby('col').agg({'val': ['mean', 'count']})— aggregate with multiple functions at oncedf.merge(other, on='key', how='left')— SQL-style joins;howacceptsinner,left,right,outerdf.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')— reshape for summary viewsdf['col'].value_counts(normalize=True)— frequency distribution as proportions
NumPy Essentials
np.array([...]),np.zeros((m, n)),np.ones((m, n)),np.arange(start, stop, step)arr.reshape(rows, cols)— change shape without copying datanp.dot(A, B)orA @ B— matrix multiplicationnp.mean(arr, axis=0)— column-wise mean;axis=1for row-wisenp.random.seed(42)— set before any random operation for reproducibility
scikit-learn: The Standard Pipeline
Nearly every supervised learning task in scikit-learn follows the same structure:
- Split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Scale (if needed):
scaler = StandardScaler(); X_train = scaler.fit_transform(X_train); X_test = scaler.transform(X_test) - Fit:
model.fit(X_train, y_train) - Predict:
y_pred = model.predict(X_test) - Evaluate:
accuracy_score(y_test, y_pred)ormean_squared_error(y_test, y_pred)
Note that fit_transform goes on training data; transform only on test data. Leaking test statistics into your scaler is one of the most common mistakes beginners make.
SQL Cheat Sheet for Data Analysis
SQL fluency separates analysts who can answer a question in three minutes from those who take three hours. Focus on these patterns:
Query Structure and Execution Order
Write order: SELECT → FROM → JOIN → WHERE → GROUP BY → HAVING → ORDER BY → LIMIT
Execution order: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
That distinction matters: you can't reference a column alias in WHERE because SELECT hasn't run yet. Use a subquery or CTE instead.
Window Functions
Window functions are the most underused SQL feature and the clearest signal of SQL maturity in interviews:
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC)— rank rows within groupsLAG(revenue, 1) OVER (ORDER BY month)— previous row value; useful for period-over-period comparisonsSUM(revenue) OVER (PARTITION BY region ORDER BY date ROWS UNBOUNDED PRECEDING)— running total
CTEs vs Subqueries
CTEs (WITH cte AS (...)) are nearly always preferable to nested subqueries: easier to read, easier to debug, and most query planners treat them identically performance-wise. When you have more than one level of nesting, switch to a CTE.
Statistics & Probability Reference
This is where most self-taught data scientists have the biggest gaps. You don't need to derive proofs from scratch, but you do need to know what these concepts actually mean:
Distributions You Need to Know
- Normal: Continuous, symmetric, defined by mean and standard deviation. Central limit theorem says sample means approach normal regardless of underlying distribution (with enough samples).
- Binomial: Count of successes in n independent trials with probability p. Think: click-through rates, A/B test conversions.
- Poisson: Count of events in a fixed interval when events are independent and rare. Think: server errors per hour, customer arrivals per minute.
Hypothesis Testing Cheat Sheet
- p-value: The probability of observing your result (or more extreme) if the null hypothesis were true. Not the probability the null is true.
- Type I error (α): False positive—rejecting a true null. You set this threshold (typically 0.05).
- Type II error (β): False negative—failing to reject a false null. Related to statistical power (1 − β).
- t-test: Compare means between two groups. Assumes normality (or large sample).
- Chi-squared test: Test independence between two categorical variables.
Correlation vs Causation
Pearson correlation measures linear association between two continuous variables (-1 to 1). It says nothing about causation and is misleading when the relationship is nonlinear. Spearman correlation handles monotonic (not just linear) relationships and is more robust to outliers. When presenting findings to stakeholders, be precise about which you're reporting and what it does and doesn't imply.
Machine Learning Fundamentals
Bias-Variance Tradeoff
High bias = underfitting (model too simple, misses patterns). High variance = overfitting (model memorizes training data, fails on new data). Your goal is the minimum point on the combined error curve. Techniques to reduce overfitting: regularization (L1/Lasso, L2/Ridge), cross-validation, dropout (neural nets), pruning (trees).
Evaluation Metrics Quick Reference
- Classification: Accuracy (misleading on imbalanced classes), Precision (of positive predictions, how many are right), Recall (of actual positives, how many did we catch), F1 (harmonic mean of precision/recall), AUC-ROC (discrimination ability across thresholds)
- Regression: MAE (mean absolute error—interpretable in original units), RMSE (penalizes large errors more heavily), R² (proportion of variance explained—can be negative for bad models)
Cross-Validation
K-fold CV splits your data into k folds, trains on k-1, validates on the held-out fold, and rotates. The result is k validation scores you average. Use this instead of a single train/test split whenever your dataset is small enough to afford it. Stratified k-fold preserves class proportions—use it for classification.
Top Courses to Fill the Gaps
A cheat sheet gets you through a review session. Courses build the deeper understanding that makes the cheat sheet unnecessary. These are the highest-rated options currently available:
Introduction to Data Analytics
Covers the full analytics workflow from asking the right question to presenting findings—useful if your statistical fundamentals are shakier than your Python. The Coursera format lets you move fast through material you already know.
Tools for Data Science
Specifically covers the ecosystem: Jupyter, GitHub, Watson Studio, and the various languages data scientists actually use in practice. Good for filling toolchain gaps without sitting through a full beginner curriculum.
Python for Data Science, AI & Development (IBM)
IBM's course goes deeper on Python fundamentals than most—pandas, NumPy, and APIs with real datasets. If the Python section of this cheat sheet exposed gaps, start here.
Analyze Data to Answer Questions
Part of Google's data analytics certificate, this course is specifically about the analysis phase: calculating, aggregating, and interpreting data to answer business questions using spreadsheets and SQL.
Process Data from Dirty to Clean
Data cleaning is the unglamorous 60-80% of real data work. This course treats it seriously rather than as a footnote—worth it if you've ever inherited a dataset from a client or colleague.
Python Data Science (edX)
A more academic treatment that covers probability and statistics alongside Python implementation—good if you want the theory and practice in the same course rather than patching them together separately.
FAQ
What should a data science cheat sheet include?
At minimum: core Python operations (pandas, NumPy, scikit-learn patterns), essential SQL including window functions and CTEs, key statistical concepts (distributions, hypothesis testing, correlation), and ML fundamentals (bias-variance, evaluation metrics, cross-validation). The goal isn't exhaustive coverage—it's having the high-frequency, easy-to-blank-on material in one place.
Is Python or R better for data science?
Python for most roles. R remains dominant in academic statistics and certain biostatistics/pharma environments. If you're targeting industry data science or ML engineering roles, Python first is the right call—SQL second. R is worth learning afterward if your specific field uses it.
How much math do I need to know for data science?
Practically: statistics and probability are non-negotiable. Linear algebra matters more as you move toward ML engineering and deep learning. Calculus is useful for understanding gradient descent but rarely applied by hand. Start with statistics, learn linear algebra when you hit models that require it, and don't let math prerequisites stall you from getting started.
What's the fastest way to learn data science from scratch?
Pick one structured course that covers Python, SQL, and statistics together rather than three separate courses in sequence. Build one real project with a messy, real-world dataset (Kaggle has many). Apply for junior analyst or data analyst roles before you feel fully ready—the job teaches you faster than any course.
Do I need a degree to get a data science job?
Not for analyst-adjacent roles, which are the realistic entry point for most career changers. For research scientist roles at large tech companies, a relevant degree (statistics, CS, applied math) is a practical filter. The certificate-plus-portfolio route works well for data analyst, business analyst, and junior data scientist titles.
What's the difference between a data analyst and a data scientist?
Data analysts primarily work with existing data to answer defined business questions—heavy on SQL, dashboards, and reporting. Data scientists build predictive models and work on less-defined problems—heavier on Python, ML, and experiment design. The line blurs in practice, and many companies use the titles interchangeably. Both roles require SQL fluency.
Bottom Line
The data science cheat sheet above covers the material that actually comes up—in interviews, in day-to-day work, and in the moments when you know what you need to do but can't remember the exact syntax. Bookmark it, but don't stop there.
If the Python section exposed real gaps, the IBM Python for Data Science course is the most direct fix. If your SQL is weak, Analyze Data to Answer Questions treats SQL seriously as a tool for answering real questions. If statistics is your weak point—which is true for more people than admit it—Introduction to Data Analytics covers the foundations without assuming a math background.
The skills listed here compound. Every project you complete makes the next one faster. Pick the most obvious gap, address it with a specific course, and move on.