Chirag Lamsal

data, stats, and code that ships.

Houston, TX

est. somewhere in Nepal

me, allegedly

Hi! I'm a data analyst and software engineer specializing in statistics and machine learning. Currently an M.S. student at Sam Houston State University.

chapter one

Experience

Graduate TA

Sam Houston State University

Sep '25 →

Teaching lab and tutoring sessions for stats and data science. Helping students work through R, Python, and statistical modeling: regression, ANOVA, the usual.

Software Engineer

Sandbox Software

'23 – '24

Shipped payment integrations that bumped platform revenue by 29%. Built a real-time auction marketplace (+11% engagement) and ran AWS / Docker / Cloudflare deploys at 99.8% uptime.

+29%

metrics #001

The Poster

STAT #2026

Abstractor · arXiv Classifier

An end-to-end NLP pipeline that classifies a paper’s title and abstract into one of 8 academic domains. 61,640 papers fetched from the arXiv API; 8 classifiers compared on macro-F1.

0.731

macro-F1

61,640

papers

classifiers

Model leaderboard

LR-L2 · TF-IDF

0.731

LinearSVC · TF-IDF

0.727

LinearSVC · SVD

0.724

LR-L2 · SVD

0.722

Multinomial NB

0.705

XGBoost · SVD

0.704

Random Forest

0.618

winner → Ridge logistic on TF-IDF, trained in 4.1s.

next time → re-fit the vectorizer on train-only (mild leakage in current run); try SciBERT to squeeze the CS↔EE confusion.

Preprocessing: LaTeX normalization (recovers \beta as beta), scispaCy lemmatization, corpus-based stopwords, and gensim phrase detection for terms like neural_network.
Validation: Before modeling: ~280K chi-square tests with BH correction, plus PERMANOVA on TF-IDF cosine distances. Pseudo-F = 6.601, p = 0.001 → classes are meaningfully separable.
Stack: Pythonscikit-learnscispaCygensimXGBoostStreamlit

✓filed under: nlp · classification · tf-idf

file #001

Case File

SUBJECT #2026

The Vitamin D Mystery

An investigation into the association between serum Vitamin D and Diabetes risk, drawing on NHANES survey data.

Hypothesis: Serum Vitamin D levels are inversely associated with Diabetes prevalence across demographics.
Method: Weighted logistic regression accounting for complex survey design. Confounding adjustment for BMI, age, and physical activity.
Instruments: Rggplot2survey pkgANOVA
Findings: Evaluated via AUC / ROC with confidence intervals. Visualized factor relationships across subgroups.

✓filed under: health-data · epidemiology · weighted-models

drawing

Blueprint

DRAWING #2024

AssistAI · RAG Pipeline

A retrieval-augmented chatbot that answers questions grounded in a user’s own document knowledge base.

Retrieval: Documents chunked & embedded into a vector store (FAISS/Chroma); queries resolved via semantic search.
Generation: Retrieved context injected into the prompt so the model answers from source material, not just priors.
Service: FastAPI endpoints expose chat + ingest; streaming responses wired to the client.
Stack: PythonLangChainFastAPIFAISS

SHEET 1 OF 1SCALE = 1 : 1DRAWN BY: C

the arsenal

Skills

The Lab

where numbers get interrogated

RPythonSASSQLC++StatisticsScikit-learnXGBoost

The Forge

where things get shipped

AWSDockerDjangoPostgreSQLGitREST APIsNginxCelery

say hi

Find Me