Available for ML roles & internships

Machine Learning
Engineer & Data Scientist

I build end-to-end ML pipelines — from raw data to deployed models — with a focus on explainability, rigorous evaluation, and real-world impact.

Python TensorFlow PyTorch Scikit-learn NLP · BERT · spaCy Computer Vision · YOLO Azure · MLflow Streamlit
Contact Me View Projects
SCROLL TO EXPLORE

Why work with me?

🧠

Strong ML Fundamentals

Deep understanding of supervised & unsupervised learning, optimization theory, and model selection — not just library calls.

⚖️

Imbalanced Data Expertise

Hands-on experience with SMOTE, class weighting, and precision-recall tradeoffs on real-world skewed datasets like fraud detection.

🔁

End-to-End Pipelines

From raw EDA → feature engineering → model training → evaluation → Streamlit deployment. No gaps in the workflow.

🧪

Interview-Ready Projects

Every project includes problem framing, dataset context, engineering decisions, metric justification, and future roadmap.

☁️

Cloud-Native MLOps Mindset

Trained on Azure ML pipelines, MLflow experiment tracking, and Hugging Face model hubs through the DEPI Microsoft specialization.

🔭

Multi-Domain Experience

Projects spanning NLP, Computer Vision, time-series anomaly detection, and structured tabular ML — versatile across problem types.

The engineer behind the models

I'm Marwan Amir, an aspiring AI Engineer and Data Scientist currently pursuing a Bachelor of Science in Computer Science with an Artificial Intelligence track at Cairo Higher Institution, maintaining a 3.5 GPA.

My ML journey began with a fundamental question: how do machines extract signal from noise? That curiosity evolved into a disciplined engineering practice — I don't just train models, I study why they work, where they fail, and how to make them production-ready.

At NTI (National Telecom Institution), I developed hands-on proficiency in the full ML lifecycle — preprocessing pipelines, model evaluation, hyperparameter tuning, and applying data-driven decision making to real datasets. I'm simultaneously completing the DEPI Microsoft ML Engineer Specialization, focusing on Azure AI, MLflow, and deploying models at scale.

Long-term, I want to contribute to teams building ML systems that are not just accurate, but robust, interpretable, and genuinely useful — in domains ranging from healthcare and finance to computer vision applications.

3.5
Current GPA BSc. Computer Science — AI Track, Cairo Higher Institution
5+
ML Projects Delivered Spanning NLP, CV, Fraud Detection, Recommender Systems
3
Professional Certifications NVIDIA Deep Learning, DEPI Azure ML, Route Academy ML/DL
Curiosity Index Constantly learning, iterating, and pushing state-of-the-art boundaries

Tools & technologies

Languages
Python SQL C++
ML / DL Frameworks
TensorFlow PyTorch Scikit-learn Keras
NLP Specialization
NLTK spaCy BERT Hugging Face TF-IDF LSTM
Computer Vision
OpenCV YOLO CNNs Autoencoders
Data Science
Pandas NumPy Matplotlib Seaborn EDA Feature Engineering
MLOps & Cloud
Azure ML MLflow Git GitHub Streamlit

Production-grade work

01 / CLASSIFICATION · IMBALANCED DATA

Credit Card Fraud Detection

Python Scikit-learn XGBoost SMOTE Pandas Seaborn

Financial fraud causes billions in global losses annually. The core challenge: fraud events are <0.2% of transactions, making naive classifiers useless — they achieve 99% accuracy by predicting "not fraud" for everything.

284,807 transactions with 30 anonymized PCA features + Amount + Time. Class imbalance ratio ≈ 577:1. Standard Kaggle credit card fraud dataset.

Standard-scaled Amount and Time features. Applied SMOTE oversampling on training set only (no data leakage). Correlation heatmaps to identify key discriminating features.

Logistic Regression (baseline) → Random Forest → XGBoost. Threshold tuning to optimize F1-score and minimize false negatives (frauds missed = real cost).

98.7%
ROC-AUC
94.2%
Recall (Fraud)
91.5%
F1-Score
SMOTE
Imbalance Strategy

Accuracy is a misleading metric for imbalanced problems. Precision-Recall curves and F2-score (weighted toward recall) are far more actionable for fraud use-cases where false negatives carry financial and legal consequences.

02 / CLASSIFICATION · BUSINESS ML

Customer Churn Prediction

Python Scikit-learn Random Forest SHAP Pandas Streamlit

Retaining an existing customer is 5–7× cheaper than acquiring a new one. Predicting which telecom customers will churn enables proactive retention campaigns — high business ROI if the model is both accurate and explainable.

IBM Telco Customer Churn dataset: 7,043 customers, 20 features including contract type, tenure, monthly charges, internet service, and payment method. ~26.5% churn rate.

Encoded categorical variables (OHE + label encoding), engineered TotalCharges / tenure ratio, binned tenure into cohorts, and handled missing values in TotalCharges via median imputation.

Logistic Regression → Decision Tree → Random Forest → Gradient Boosting. GridSearchCV for hyperparameter optimization. SHAP values for feature importance and business-level interpretability.

86.4%
Accuracy
81.2%
ROC-AUC
78.9%
Recall (Churn)
SHAP
Explainability

Month-to-month contracts and high monthly charges with low tenure are the strongest churn predictors. SHAP waterfall plots communicated these drivers to non-technical stakeholders clearly, enabling targeted retention offers.

03 / CLASSIFICATION · HEALTHCARE ML

Breast Cancer Classification

Python Scikit-learn SVM PCA Cross-Validation

Early-stage breast cancer detection dramatically increases survival rates. The goal: build a high-recall classifier on clinical cell measurements to identify malignant tumors — where a false negative means a missed diagnosis.

Wisconsin Breast Cancer Dataset (WBCD): 569 samples, 30 features (mean, SE, worst for 10 cell-nucleus measurements). 212 malignant / 357 benign. Well-studied benchmark.

StandardScaler normalization across all 30 features. PCA dimensionality reduction to 10 components (95% variance retained). Correlation analysis to remove highly collinear features before SVM training.

KNN → Logistic Regression → SVM (RBF kernel). 10-fold stratified cross-validation for robust evaluation. Precision-Recall curves to select decision threshold maximizing recall on malignant class.

97.4%
Accuracy
98.1%
Recall (Malignant)
96.8%
F1-Score
10-Fold
CV Strategy

In medical diagnosis, recall is the primary metric — a false positive (unnecessary biopsy) is far less harmful than a false negative (missed cancer). Tuning the classification threshold from 0.5 → 0.35 increased recall by 4.2% with minimal precision loss.

04 / REGRESSION · REAL ESTATE ML

House Price Prediction

Python Scikit-learn XGBoost Feature Engineering Seaborn

Real estate pricing is notoriously opaque. A robust regression model gives buyers, sellers, and agents a data-driven price anchor — reducing information asymmetry and improving market efficiency.

Ames Housing Dataset: 1,460 training samples, 79 features covering lot area, neighborhood, build year, quality ratings, basement specs, garage info, and sale conditions.

Log-transformed skewed features (SalePrice, LotArea), ordinal encoding for quality grades, engineered TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF, filled 19 missing-value columns using domain-appropriate strategies.

Linear Regression → Ridge/Lasso → Random Forest → XGBoost. Evaluated using RMSE, MAE, and R². Lasso used for automatic feature selection; XGBoost for final performance.

0.91
R² Score
$18.4K
Mean Abs. Error
XGBoost
Best Model
79
Features

Overall quality rating and total square footage account for ~60% of explained variance. Log-transforming SalePrice reduced the impact of luxury outliers and significantly improved linear model performance — a reminder that distribution assumptions matter.

05 / COMPUTER VISION · DEEP LEARNING

Anomaly Detection in Video Surveillance

Python TensorFlow CNN Autoencoders OpenCV

Manual video surveillance is infeasible at scale. Automated anomaly detection — identifying abnormal pedestrian behavior like running, fighting, or crowd surges — enables real-time public safety alerting.

UCSD Pedestrian Dataset: two subsets (Ped1/Ped2) of surveillance video frames with annotated anomalous events. Frames extracted and preprocessed into temporal sequences.

CNN-based spatial feature extraction from individual frames. Stacked frame sequences for temporal modeling. Reconstruction error thresholding for anomaly scoring.

Convolutional Autoencoder trained on normal-only sequences. Anomaly score = frame reconstruction error. Events exceeding a calibrated threshold flagged as anomalous.

CNN
Feature Extractor
AE
Autoencoder Core
UCSD
Benchmark Dataset

Unsupervised anomaly detection via reconstruction error is powerful but sensitive to threshold calibration. Training exclusively on normal behavior forces the autoencoder to encode the "normal manifold" — making abnormal events high-reconstruction-error outliers by design.

06 / NLP · DEEP LEARNING

Twitter Sentiment Analysis

Python TensorFlow LSTM NLP Keras

Social media sentiment is a real-time pulse of public opinion. Accurately classifying tweet sentiment (positive/negative/neutral) has applications in brand monitoring, crisis detection, and financial signal generation.

Twitter sentiment dataset with labeled tweets. Preprocessing involved handling hashtags, mentions, URLs, emojis, and informal language — significantly noisier than formal text corpora.

Text cleaning pipeline: lowercasing, stopword removal, tokenization. Word-to-index mapping with padding for fixed-length sequences. Embedding layer trained end-to-end with the LSTM.

Bidirectional LSTM with dropout regularization. Embedding layer → BiLSTM → Dense → Sigmoid. Compared against TF-IDF + Logistic Regression baseline to quantify deep learning uplift.

BiLSTM
Architecture
NLP
Domain
End-2-End
Pipeline

Twitter's informal language (slang, sarcasm, abbreviations) makes preprocessing critical. Bidirectional LSTMs outperform unidirectional by capturing context from both directions — especially valuable for short, context-dependent tweet language.

Academic background

Sept. 2023 — Present
Bachelor of Science in Computer Science
Artificial Intelligence Track · Cairo Higher Institution, Egypt
Current GPA: 3.5 / 4.0. Core coursework includes Machine Learning, Deep Learning, Data Structures & Algorithms, Database Systems, Computer Vision, and Natural Language Processing. AI track provides hands-on exposure to model development, statistical learning theory, and applied project work.
Sept. 2010 — June 2023
High School Diploma
Al Raya Language School, Cairo, Egypt
Completed secondary education with a focus on mathematics and sciences. Multilingual education environment — strong analytical and communication foundation.
🏆
Microsoft ML Engineer Specialization
DIGITAL EGYPT PIONEERS INITIATIVE (DEPI)
Nov. 2025 – July 2026 · Azure ML, MLflow, Hugging Face, AI-102
🧠
Machine Learning & Deep Learning
ROUTE ACADEMY
Feb. 2025 – June 2025 · Full ML/DL engineering curriculum
Foundations of Deep Learning
NVIDIA DEEP LEARNING INSTITUTE
Sept. 2025 · GPU-accelerated DL fundamentals, CUDA training

Where I've built

AI & Machine Learning Developer
NTI — National Telecom Institution · Nasr City, Cairo
Sept. 2025 – Apr. 2026
  • Developed a solid foundation in core ML concepts — supervised & unsupervised learning, gradient descent, regularization, and optimization techniques — through structured hands-on coursework.
  • Executed complete data analysis workflows: ingestion, cleaning, missing value treatment, outlier detection, and exploratory data analysis (EDA) using Python, Pandas, and Seaborn.
  • Built, trained, and evaluated classification and regression models including Linear Regression, Decision Trees, and Random Forest, applying hyperparameter tuning via GridSearchCV and cross-validation.
  • Applied ML techniques to real-world institutional datasets to derive actionable insights, supporting data-driven reporting and decision-making within the organization.
  • Collaborated in a structured institutional environment, reinforcing professional standards for model documentation, reproducibility, and code quality.

Let's build something
intelligent together.

I'm actively seeking ML engineering roles, data science internships, and research collaborations. If you're building something ambitious and need someone who takes both the math and the engineering seriously — let's talk.

Send a Message