Chirag Lamsal

data, stats, and code that ships.

Houston, TX
est. somewhere in Nepal
Chirag Lamsal

Hi! I'm a data analyst and software engineer specializing in statistics and machine learning. Currently an M.S. student at Sam Houston State University.

Experience

Graduate TA

Sam Houston State University

Sep '25 →

Teaching lab and tutoring sessions for stats and data science. Helping students work through R, Python, and statistical modeling: regression, ANOVA, the usual.

Software Engineer

Sandbox Software

'23 – '24

Shipped payment integrations that bumped platform revenue by 29%. Built a real-time auction marketplace (+11% engagement) and ran AWS / Docker / Cloudflare deploys at 99.8% uptime.

+29%

The Poster

STAT #2026

Abstractor · arXiv Classifier

An end-to-end NLP pipeline that classifies a paper’s title and abstract into one of 8 academic domains. 61,640 papers fetched from the arXiv API; 8 classifiers compared on macro-F1.

0.731
macro-F1
61,640
papers
8
classifiers

Model leaderboard

LR-L2 · TF-IDF
0.731
LinearSVC · TF-IDF
0.727
LinearSVC · SVD
0.724
LR-L2 · SVD
0.722
Multinomial NB
0.705
XGBoost · SVD
0.704
Random Forest
0.618

winner → Ridge logistic on TF-IDF, trained in 4.1s.

next time → re-fit the vectorizer on train-only (mild leakage in current run); try SciBERT to squeeze the CS↔EE confusion.
Preprocessing
LaTeX normalization (recovers \beta as beta), scispaCy lemmatization, corpus-based stopwords, and gensim phrase detection for terms like neural_network.
Validation
Before modeling: ~280K chi-square tests with BH correction, plus PERMANOVA on TF-IDF cosine distances. Pseudo-F = 6.601, p = 0.001 → classes are meaningfully separable.
Stack
Pythonscikit-learnscispaCygensimXGBoostStreamlit
filed under: nlp · classification · tf-idf

Case File

SUBJECT #2026

The Vitamin D Mystery

An investigation into the association between serum Vitamin D and Diabetes risk, drawing on NHANES survey data.

Hypothesis
Serum Vitamin D levels are inversely associated with Diabetes prevalence across demographics.
Method
Weighted logistic regression accounting for complex survey design. Confounding adjustment for BMI, age, and physical activity.
Instruments
Rggplot2survey pkgANOVA
Findings
Evaluated via AUC / ROC with confidence intervals. Visualized factor relationships across subgroups.
filed under: health-data · epidemiology · weighted-models

Blueprint

DRAWING #2024

AssistAI · RAG Pipeline

A retrieval-augmented chatbot that answers questions grounded in a user’s own document knowledge base.

User QueryEmbed + SearchLLM + ContextAnswerVector DB
Retrieval
Documents chunked & embedded into a vector store (FAISS/Chroma); queries resolved via semantic search.
Generation
Retrieved context injected into the prompt so the model answers from source material, not just priors.
Service
FastAPI endpoints expose chat + ingest; streaming responses wired to the client.
Stack
PythonLangChainFastAPIFAISS
SHEET 1 OF 1SCALE = 1 : 1DRAWN BY: C

Skills

The Lab

where numbers get interrogated

RPythonSASSQLC++StatisticsScikit-learnXGBoost

The Forge

where things get shipped

AWSDockerDjangoPostgreSQLGitREST APIsNginxCelery

Find Me