Course Overview
Introduction to Natural Language Processing introduces students to computational methods for representing, analyzing, and generating human language. The course starts with classical statistical NLP and gradually moves toward modern neural methods, transformers, and large language models. A recurring emphasis is placed on Turkish NLP, low-resource language challenges, reproducible experimentation, and practical model evaluation.
In recent offerings, the project component has evolved toward evaluating and fine-tuning small open-source LLMs for Turkish tasks. Students work with public datasets, convert task data into instruction-style formats, run baseline evaluations, apply supervised fine-tuning, and analyze whether the adapted model improves over the base model.
Learning Outcomes
Conceptual Understanding
- Explain the main ideas behind statistical NLP, corpus-based analysis, language modeling, embeddings, and neural sequence models.
- Compare rule-based, statistical, and neural approaches to language processing.
- Understand why Turkish creates distinctive modeling challenges because of morphology, data sparsity, and resource availability.
Technical Skills
- Prepare text data, tokenize it, represent it numerically, and train or evaluate NLP models.
- Compute and interpret metrics such as accuracy, F1, cross-entropy, perplexity, and baseline-vs-fine-tuned performance.
- Use reproducible workflows with Python, GitHub, Hugging Face, Google Colab, and modern fine-tuning tools.
Core Topic Map
The exact order may vary, but the course normally covers the following themes.
1. Introduction and Linguistic Foundations
What NLP is, rationalist vs. empiricist traditions, ambiguity, probability in language, Zipf's Law, corpus-based thinking, morphology, parts of speech, syntactic ambiguity, semantic roles, and basic linguistic terminology.
2. Text Processing and Corpus-Based Work
Regular expressions, tokenization, sentence segmentation, stemming, lemmatization, edit distance, spelling correction, corpus design, word classes, and practical preprocessing issues.
3. Collocations and Statistical Association
Collocations, idioms, frequency-based extraction, part-of-speech filtering, hypothesis testing, t-test, chi-square test, mutual information, and limitations of low-frequency evidence.
4. N-gram Language Models
Chain rule, Markov assumption, unigram/bigram/trigram models, maximum likelihood estimation, sparse data, smoothing, back-off, interpolation, deleted interpolation, perplexity, and language model evaluation.
5. Text Categorization and Classical Machine Learning
Naive Bayes, logistic regression, sentiment analysis, text classification pipelines, bag-of-words representations, feature weighting, regularization, and evaluation of classification models.
6. Vector Semantics and Embeddings
Distributional hypothesis, sparse and dense vectors, tf-idf, PMI, PPMI, cosine similarity, document centroids, word2vec, skip-gram with negative sampling, fastText, and intrinsic/extrinsic evaluation.
7. Neural Language Models
Feed-forward neural networks, activations, softmax, gradients, cross-entropy, multi-layer networks, neural language models, RNNs, LSTMs, bidirectionality, attention, encoder-decoder models, and sequence labeling vs. sequence classification.
8. Transformers and Large Language Models
Self-attention, masked language modeling, causal language modeling, transformer architecture, pretraining, prompting, in-context learning, instruction tuning, alignment, and the role of retrieval and evaluation in modern LLM systems.
9. Turkish NLP and Applied LLM Projects
Turkish datasets, instruction-data construction, supervised fine-tuning, parameter-efficient fine-tuning, model comparison, benchmark evaluation, error analysis, reproducibility, model cards, and dataset cards.
Recent Reading Structure
Recent offerings have combined classical statistical NLP readings with the latest public draft of Speech and Language Processing. Older classroom announcements used the 2025 chapter numbering; the public 2026 draft reorganizes the LLM-related chapters, so this public page uses topic names rather than depending only on chapter numbers.
Classical Statistical NLP
- Manning & Schütze, Foundations of Statistical Natural Language Processing: Introduction.
- Linguistic essentials and corpus-based work.
- Collocations and statistical association tests.
- N-gram models, sparse data, smoothing, back-off, and interpolation.
Modern NLP and LLMs
- Jurafsky & Martin, Speech and Language Processing: words and tokens, n-gram models, text classification, embeddings, neural networks, LLMs, transformers, post-training, masked language models, RAG, machine translation, and RNNs/LSTMs.
- Selected papers and technical notes on word2vec, fastText-style efficient text classification, Turkish LLM evaluation, and parameter-efficient fine-tuning.
Assessment and Course Work
Assessment varies by semester. A recent grading pattern used quizzes, a midterm exam, a term project, and a final exam. One recent example was: quizzes 15%, midterm exam 30%, project 15%, and final exam 40%.
Quizzes
Reading-based checks and short technical questions.
Midterm
Statistical NLP foundations, classification, embeddings, and calculations.
Project
Team-based NLP or LLM experimentation with reproducible deliverables.
Final
Broader coverage including neural models, transformers, LLMs, and project-related concepts.
Exam Preparation Themes
Students are expected to understand concepts and perform manageable hand calculations. Scientific calculators are normally useful for operations such as sigmoid, softmax, gradients, log probabilities, and cross-entropy.
Conceptual Questions
- Explain why sparse data is a major problem in n-gram language modeling.
- Compare rule-based tokenization, BPE-style subword tokenization, and morphological analysis.
- Explain why cosine similarity is usually preferred over raw dot product in vector semantics.
- Compare n-gram, feed-forward neural, RNN, transformer, masked, and causal language models.
- Explain the difference between a causal language model and a masked language model.
Computational Questions
- Compute simple n-gram probabilities, smoothed probabilities, sentence likelihoods, and perplexity.
- Calculate tf-idf, PMI/PPMI, cosine similarity, and document centroids.
- Compute sigmoid or softmax outputs and cross-entropy loss.
- Count neural network or RNN parameters from given matrix dimensions.
- Analyze baseline vs. fine-tuned model results using evaluation tables.
Term Project: Recent LLM Fine-Tuning Direction
Recent projects ask students to evaluate and fine-tune small open-source LLMs for Turkish. The project is designed to connect NLP theory to realistic model-development practice while keeping the work reproducible and comparable across groups.
Typical Objective
- Select two small open-source LLMs, usually in the approximate 0.3B to 4B parameter range.
- Evaluate both base models on a shared Turkish public test corpus.
- Select one model for supervised fine-tuning.
- Fine-tune using a public SFT corpus, typically with LoRA or QLoRA-style parameter-efficient methods.
- Re-evaluate on the same held-out test corpus and analyze improvements, failures, and error patterns.
Recent Dataset Examples
Turkish Legal QA
Renicames/turkish-law-chatbot is a Turkish legal question-answering dataset with train/test splits. It is useful for Turkish legal NLP, legal QA, and domain adaptation experiments.
Turkish Medical QA
alibayram/doktorsitesi is a Turkish medical question-answer dataset collected from doctor-patient Q&A content. It is useful for Turkish medical NLP and domain-specific chatbot experiments. Access conditions and licensing should be checked before use.
Typical Deliverables
Short slides explaining selected models, baseline evaluation, dataset preparation, and fine-tuning plan.
Brief PDF report with group members, selected models, test corpus, SFT corpus, and baseline table.
IEEE-style LaTeX report covering experiment setup, preprocessing, fine-tuning method, hyperparameters, results, error analysis, and conclusions.
Concise technical presentation of model selection, SFT setup, before/after results, sample outputs, and error analysis.
GitHub repository with evaluation scripts, fine-tuning scripts, inference scripts, preprocessing code, helper utilities, and README.
Dataset names, identifiers, sources, descriptions, licenses/access conditions, and how each dataset was used.
Project Evaluation Dimensions
- Correct baseline setup and fair model comparison.
- Proper separation of training and test data; the test set must remain unseen during training.
- Quality of instruction-data transformation and preprocessing.
- Appropriate fine-tuning configuration and hyperparameter reporting.
- Clear before/after evaluation and error analysis.
- Reproducible code, clean README, and meaningful commit history.
- Quality of academic writing and presentation.
Practical Infrastructure
Students are encouraged to use modern open-source NLP infrastructure and to document their work professionally.
Experimentation
Python, PyTorch, Hugging Face Transformers/Datasets, Google Colab, and GPU-aware batch-size management.
Fine-Tuning
Unsloth, LoRA/QLoRA-style parameter-efficient fine-tuning, instruction datasets, and checkpoint management.
Evaluation
Turkish LLM benchmark tools, classification metrics, qualitative output analysis, and reproducible evaluation scripts.
Instruction Dataset Construction
A common project task is to transform a downstream NLP dataset into an instruction-style JSON format. The structure below illustrates the expected idea.
[
{
"instruction": "Aşağıdaki yorumun duygu durumunun olumlu veya olumsuz olduğunu söyle.",
"input": "Çok heyecanlı bir filmdi, çok beğendim.",
"output": "olumlu"
},
{
"instruction": "Aşağıdaki metni özetle.",
"input": "...uzun metin...",
"output": "...kısa özet..."
},
{
"instruction": "Büyük dil modelleri neden önemlidir?",
"input": "",
"output": "Büyük dil modelleri, insan dilini anlama ve üretme yetenekleri nedeniyle önemlidir."
}
]
Students should keep the transformation rule explicit, validate output quality, remove duplicates where necessary, preserve clean train/test separation, and document dataset sources.
Useful Public Resources
Course Policies Frequently Emphasized
- Students should follow official deadlines carefully; confusing AM/PM deadline times or uploading at the last minute can create avoidable problems.
- Group members are expected to contribute actively, and GitHub commit history may be used as evidence of individual contribution.
- Attendance requirements and signature-based tracking may be enforced according to university and course policy.
- Project artifacts should be reproducible, well documented, and suitable for later academic or portfolio use.
- Model and dataset cards are encouraged when projects produce reusable models or datasets.
Who Should Take This Course?
The course is suitable for students who want to understand both the mathematical foundations of NLP and the current practice of building NLP/LLM systems. It is especially relevant for students interested in Turkish NLP, text classification, information extraction, generative AI, language model evaluation, legal AI, healthcare AI, and applied data science.