Available for ML roles & internships

Machine Learning
Engineer & Data Scientist

I build end-to-end ML pipelines — from raw data to deployed models — with a focus on explainability, rigorous evaluation, and real-world impact.

Python TensorFlow PyTorch Scikit-learn NLP · BERT · spaCy Computer Vision · YOLO Azure · MLflow Streamlit

Contact Me View Projects

SCROLL TO EXPLORE

unique selling points

Why work with me?

🧠

Strong ML Fundamentals

Deep understanding of supervised & unsupervised learning, optimization theory, and model selection — not just library calls.

⚖️

Imbalanced Data Expertise

Hands-on experience with SMOTE, class weighting, and precision-recall tradeoffs on real-world skewed datasets like fraud detection.

🔁

End-to-End Pipelines

From raw EDA → feature engineering → model training → evaluation → Streamlit deployment. No gaps in the workflow.

🧪

Interview-Ready Projects

Every project includes problem framing, dataset context, engineering decisions, metric justification, and future roadmap.

☁️

Cloud-Native MLOps Mindset

Trained on Azure ML pipelines, MLflow experiment tracking, and Hugging Face model hubs through the DEPI Microsoft specialization.

🔭

Multi-Domain Experience

Projects spanning NLP, Computer Vision, time-series anomaly detection, and structured tabular ML — versatile across problem types.

about me

The engineer behind the models

I'm Marwan Amir, an aspiring AI Engineer and Data Scientist currently pursuing a Bachelor of Science in Computer Science with an Artificial Intelligence track at Cairo Higher Institution, maintaining a 3.5 GPA.

My ML journey began with a fundamental question: how do machines extract signal from noise? That curiosity evolved into a disciplined engineering practice — I don't just train models, I study why they work, where they fail, and how to make them production-ready.

At NTI (National Telecom Institution), I developed hands-on proficiency in the full ML lifecycle — preprocessing pipelines, model evaluation, hyperparameter tuning, and applying data-driven decision making to real datasets. I'm simultaneously completing the DEPI Microsoft ML Engineer Specialization, focusing on Azure AI, MLflow, and deploying models at scale.

Long-term, I want to contribute to teams building ML systems that are not just accurate, but robust, interpretable, and genuinely useful — in domains ranging from healthcare and finance to computer vision applications.

3.5

Current GPA BSc. Computer Science — AI Track, Cairo Higher Institution

ML Projects Delivered Spanning NLP, CV, Fraud Detection, Recommender Systems

Professional Certifications NVIDIA Deep Learning, DEPI Azure ML, Route Academy ML/DL

∞

Curiosity Index Constantly learning, iterating, and pushing state-of-the-art boundaries

technical skills

Tools & technologies

Languages

Python SQL C++

ML / DL Frameworks

TensorFlow PyTorch Scikit-learn Keras

NLP Specialization

NLTK spaCy BERT Hugging Face TF-IDF LSTM

Computer Vision

OpenCV YOLO CNNs Autoencoders

Data Science

Pandas NumPy Matplotlib Seaborn EDA Feature Engineering

MLOps & Cloud

Azure ML MLflow Git GitHub Streamlit

machine learning projects

Production-grade work

01 / CLASSIFICATION · IMBALANCED DATA

Credit Card Fraud Detection

Python Scikit-learn XGBoost SMOTE Pandas Seaborn

Problem Statement

Financial fraud causes billions in global losses annually. The core challenge: fraud events are <0.2% of transactions, making naive classifiers useless — they achieve 99% accuracy by predicting "not fraud" for everything.

Dataset

284,807 transactions with 30 anonymized PCA features + Amount + Time. Class imbalance ratio ≈ 577:1. Standard Kaggle credit card fraud dataset.

Feature Engineering

Standard-scaled Amount and Time features. Applied SMOTE oversampling on training set only (no data leakage). Correlation heatmaps to identify key discriminating features.

Models & Techniques

Logistic Regression (baseline) → Random Forest → XGBoost. Threshold tuning to optimize F1-score and minimize false negatives (frauds missed = real cost).

98.7%

ROC-AUC

94.2%

Recall (Fraud)

91.5%

F1-Score

SMOTE

Imbalance Strategy

Key Insight

Accuracy is a misleading metric for imbalanced problems. Precision-Recall curves and F2-score (weighted toward recall) are far more actionable for fraud use-cases where false negatives carry financial and legal consequences.

GitHub Repository

02 / CLASSIFICATION · BUSINESS ML

Customer Churn Prediction

Python Scikit-learn Random Forest SHAP Pandas Streamlit

Problem Statement

Retaining an existing customer is 5–7× cheaper than acquiring a new one. Predicting which telecom customers will churn enables proactive retention campaigns — high business ROI if the model is both accurate and explainable.

Dataset

IBM Telco Customer Churn dataset: 7,043 customers, 20 features including contract type, tenure, monthly charges, internet service, and payment method. ~26.5% churn rate.

Feature Engineering

Encoded categorical variables (OHE + label encoding), engineered TotalCharges / tenure ratio, binned tenure into cohorts, and handled missing values in TotalCharges via median imputation.

Models & Techniques

Logistic Regression → Decision Tree → Random Forest → Gradient Boosting. GridSearchCV for hyperparameter optimization. SHAP values for feature importance and business-level interpretability.

86.4%

Accuracy

81.2%

ROC-AUC

78.9%

Recall (Churn)

SHAP

Explainability

Key Insight

Month-to-month contracts and high monthly charges with low tenure are the strongest churn predictors. SHAP waterfall plots communicated these drivers to non-technical stakeholders clearly, enabling targeted retention offers.

GitHub Repository

03 / CLASSIFICATION · HEALTHCARE ML

Breast Cancer Classification

Python Scikit-learn SVM PCA Cross-Validation

Problem Statement

Early-stage breast cancer detection dramatically increases survival rates. The goal: build a high-recall classifier on clinical cell measurements to identify malignant tumors — where a false negative means a missed diagnosis.

Dataset

Wisconsin Breast Cancer Dataset (WBCD): 569 samples, 30 features (mean, SE, worst for 10 cell-nucleus measurements). 212 malignant / 357 benign. Well-studied benchmark.

Feature Engineering

StandardScaler normalization across all 30 features. PCA dimensionality reduction to 10 components (95% variance retained). Correlation analysis to remove highly collinear features before SVM training.

Models & Techniques

KNN → Logistic Regression → SVM (RBF kernel). 10-fold stratified cross-validation for robust evaluation. Precision-Recall curves to select decision threshold maximizing recall on malignant class.

97.4%

Accuracy

98.1%

Recall (Malignant)

96.8%

F1-Score

10-Fold

CV Strategy

Key Insight

In medical diagnosis, recall is the primary metric — a false positive (unnecessary biopsy) is far less harmful than a false negative (missed cancer). Tuning the classification threshold from 0.5 → 0.35 increased recall by 4.2% with minimal precision loss.

GitHub Repository

04 / REGRESSION · REAL ESTATE ML

House Price Prediction

Python Scikit-learn XGBoost Feature Engineering Seaborn

Problem Statement

Real estate pricing is notoriously opaque. A robust regression model gives buyers, sellers, and agents a data-driven price anchor — reducing information asymmetry and improving market efficiency.

Dataset

Ames Housing Dataset: 1,460 training samples, 79 features covering lot area, neighborhood, build year, quality ratings, basement specs, garage info, and sale conditions.

Feature Engineering

Log-transformed skewed features (SalePrice, LotArea), ordinal encoding for quality grades, engineered TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF, filled 19 missing-value columns using domain-appropriate strategies.

Models & Techniques

Linear Regression → Ridge/Lasso → Random Forest → XGBoost. Evaluated using RMSE, MAE, and R². Lasso used for automatic feature selection; XGBoost for final performance.

0.91

R² Score

$18.4K

Mean Abs. Error

XGBoost

Best Model

Features

Key Insight

Overall quality rating and total square footage account for ~60% of explained variance. Log-transforming SalePrice reduced the impact of luxury outliers and significantly improved linear model performance — a reminder that distribution assumptions matter.

GitHub Repository

05 / COMPUTER VISION · DEEP LEARNING

Anomaly Detection in Video Surveillance

Python TensorFlow CNN Autoencoders OpenCV

Problem Statement

Manual video surveillance is infeasible at scale. Automated anomaly detection — identifying abnormal pedestrian behavior like running, fighting, or crowd surges — enables real-time public safety alerting.

Dataset

UCSD Pedestrian Dataset: two subsets (Ped1/Ped2) of surveillance video frames with annotated anomalous events. Frames extracted and preprocessed into temporal sequences.

Feature Engineering

CNN-based spatial feature extraction from individual frames. Stacked frame sequences for temporal modeling. Reconstruction error thresholding for anomaly scoring.

Models & Techniques

Convolutional Autoencoder trained on normal-only sequences. Anomaly score = frame reconstruction error. Events exceeding a calibrated threshold flagged as anomalous.

CNN

Feature Extractor

Autoencoder Core

UCSD

Benchmark Dataset

Key Insight

Unsupervised anomaly detection via reconstruction error is powerful but sensitive to threshold calibration. Training exclusively on normal behavior forces the autoencoder to encode the "normal manifold" — making abnormal events high-reconstruction-error outliers by design.

GitHub Repository

06 / NLP · DEEP LEARNING

Twitter Sentiment Analysis

Python TensorFlow LSTM NLP Keras

Problem Statement

Social media sentiment is a real-time pulse of public opinion. Accurately classifying tweet sentiment (positive/negative/neutral) has applications in brand monitoring, crisis detection, and financial signal generation.

Dataset

Twitter sentiment dataset with labeled tweets. Preprocessing involved handling hashtags, mentions, URLs, emojis, and informal language — significantly noisier than formal text corpora.

Feature Engineering

Text cleaning pipeline: lowercasing, stopword removal, tokenization. Word-to-index mapping with padding for fixed-length sequences. Embedding layer trained end-to-end with the LSTM.

Models & Techniques

Bidirectional LSTM with dropout regularization. Embedding layer → BiLSTM → Dense → Sigmoid. Compared against TF-IDF + Logistic Regression baseline to quantify deep learning uplift.

BiLSTM

Architecture

NLP

Domain

End-2-End

Pipeline

Key Insight

Twitter's informal language (slang, sarcasm, abbreviations) makes preprocessing critical. Bidirectional LSTMs outperform unidirectional by capturing context from both directions — especially valuable for short, context-dependent tweet language.

GitHub Repository

education

Academic background

Sept. 2023 — Present

Bachelor of Science in Computer Science

Artificial Intelligence Track · Cairo Higher Institution, Egypt

Current GPA: 3.5 / 4.0. Core coursework includes Machine Learning, Deep Learning, Data Structures & Algorithms, Database Systems, Computer Vision, and Natural Language Processing. AI track provides hands-on exposure to model development, statistical learning theory, and applied project work.

Sept. 2010 — June 2023

High School Diploma

Al Raya Language School, Cairo, Egypt

Completed secondary education with a focus on mathematics and sciences. Multilingual education environment — strong analytical and communication foundation.

certifications

🏆

Microsoft ML Engineer Specialization

DIGITAL EGYPT PIONEERS INITIATIVE (DEPI)

Nov. 2025 – July 2026 · Azure ML, MLflow, Hugging Face, AI-102

🧠

Machine Learning & Deep Learning

ROUTE ACADEMY

Feb. 2025 – June 2025 · Full ML/DL engineering curriculum

⚡

Foundations of Deep Learning

NVIDIA DEEP LEARNING INSTITUTE

Sept. 2025 · GPU-accelerated DL fundamentals, CUDA training

work experience

Where I've built

AI & Machine Learning Developer

NTI — National Telecom Institution · Nasr City, Cairo

Sept. 2025 – Apr. 2026

Developed a solid foundation in core ML concepts — supervised & unsupervised learning, gradient descent, regularization, and optimization techniques — through structured hands-on coursework.
Executed complete data analysis workflows: ingestion, cleaning, missing value treatment, outlier detection, and exploratory data analysis (EDA) using Python, Pandas, and Seaborn.
Built, trained, and evaluated classification and regression models including Linear Regression, Decision Trees, and Random Forest, applying hyperparameter tuning via GridSearchCV and cross-validation.
Applied ML techniques to real-world institutional datasets to derive actionable insights, supporting data-driven reporting and decision-making within the organization.
Collaborated in a structured institutional environment, reinforcing professional standards for model documentation, reproducibility, and code quality.

Let's build something
intelligent together.

I'm actively seeking ML engineering roles, data science internships, and research collaborations. If you're building something ambitious and need someone who takes both the math and the engineering seriously — let's talk.

✉️

Email marwanamir125@gmail.com

→

💼

LinkedIn linkedin.com/in/marwanamir

→

GitHub github.com/Mxroxx

→

📞

Phone +20 101 276 7690

→