Natural Language Processing

Public summary. This page summarizes recurring course content and recent research directions. Semester-specific Classroom links, exact deadlines, student lists, private grades, and other operational details are intentionally omitted.

Course Overview

Natural Language Processing is a graduate-level course designed for students who want both theoretical depth and research-oriented practice in modern language technologies. The course starts from statistical NLP foundations and progresses toward neural language models, transformers, large language models, instruction tuning, masked language models, information retrieval, retrieval-augmented generation, and evaluation of Turkish language technologies.

The course is deliberately research-facing. Students read textbook chapters and papers, prepare presentations, implement project pipelines, evaluate models, and write final reports in an academic style. Recent offerings emphasize hallucination detection and correction in domain-specific LLMs, Turkish LLM benchmarks, Turkish downstream datasets, mixture-of-experts ideas, and rigorous evaluation of generative models.

Statistical NLP Corpus-based methods Collocations N-gram models Text classification Word embeddings RNNs and LSTMs Transformers LLMs Instruction tuning RAG Turkish NLP Benchmarks Hallucination detection

Learning Outcomes

Foundational NLP Theory

Explain rationalist, empiricist, statistical, neural, and LLM-based approaches to language processing.
Analyze corpora using probabilistic, linguistic, and distributional assumptions.
Understand ambiguity, sparsity, Zipfian distributions, morphology, collocations, and statistical inference in NLP.

Modeling and Computation

Compute and interpret quantities such as cosine similarity, entropy, sigmoid/softmax outputs, cross-entropy, perplexity, precision, recall, F1, BLEU, ROUGE, chrF, METEOR, and embedding-based similarity scores.
Compare static embeddings, contextual embeddings, causal language models, masked language models, encoder-decoder models, and retrieval-augmented systems.
Explain the computational role of self-attention, residual connections, layer normalization, positional embeddings, and feed-forward sublayers in transformer blocks.

Research Literacy

Read, summarize, and critique current NLP and LLM research papers.
Prepare paper presentations that clearly explain the problem, method, datasets, experiments, limitations, and contribution.
Connect paper-level claims to reproducible experiments and empirical evaluation.

Project and Experiment Design

Design datasets, baselines, experiments, and evaluation protocols for Turkish NLP and LLM tasks.
Evaluate domain-specific systems in healthcare, law, information retrieval, summarization, classification, question answering, and benchmark construction.
Write an IEEE-style research report with clear problem definition, related work, methodology, results, error analysis, and limitations.

Textbooks and Current Reference Structure

Recent offerings combine classical statistical NLP, neural NLP, and modern LLM materials. Because public drafts of NLP textbooks are updated over time, this page uses current public topic names rather than relying only on old classroom chapter numbers.

Core Textbooks

Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press.
Daniel Jurafsky and James H. Martin, Speech and Language Processing, 3rd edition draft.
Jacob Eisenstein, Natural Language Processing.
Yoav Goldberg, A Primer on Neural Network Models for Natural Language Processing.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning.

Current Public SLP Topic Map

Words and tokens; edit distance; n-gram language models.
Logistic regression and text classification; embeddings; neural networks.
Large language models; transformers; post-training, instruction tuning, alignment, and test-time compute.
Masked language models; information retrieval and retrieval-augmented generation; machine translation; RNNs and LSTMs.
Sequence labeling for POS and named entities, information extraction, sentiment lexicons, coreference, and discourse.

Core Topic Map

The exact order may change by semester, but the graduate course normally covers the following themes.

1. Statistical NLP Foundations

Rationalist and empiricist traditions, probability in language, ambiguity, Zipf's Law, hapax legomena, corpus-based analysis, linguistic essentials, mathematical foundations, and the role of statistics in NLP.

2. Corpus-Based Work and Collocations

Corpus design, word classes, morphology, syntactic ambiguity, collocations, partially compositional expressions, raw-frequency limitations, POS filtering, t-test, chi-square tests, mutual information, and statistical independence.

3. N-gram Language Models

Chain rule, Markov assumption, unigram, bigram and trigram models, maximum-likelihood estimation, sparse-data problems, smoothing, Good-Turing estimation, back-off, interpolation, deleted interpolation, perplexity, and language-model evaluation.

4. Traditional Supervised NLP

Text categorization, Naive Bayes, logistic regression, sentiment analysis, feature representation, class imbalance, regularization, evaluation metrics, and error analysis.

5. Vector Semantics and Embeddings

Distributional semantics, tf-idf, PMI, PPMI, cosine similarity, sparse versus dense vectors, word2vec, negative sampling, subword embeddings, static embeddings, contextual embeddings, and downstream evaluation.

6. Neural Language Models and Sequence Processing

Feed-forward neural networks, neural language models, softmax, cross-entropy, gradient descent, RNNs, backpropagation through time, vanishing gradients, LSTMs, bidirectional models, sequence labeling, sequence classification, encoder-decoder architectures, and attention.

7. Transformers and Pretrained Language Models

Self-attention, queries, keys, values, multi-head attention, positional embeddings, residual connections, layer normalization, feed-forward networks, masked language modeling, causal language modeling, transfer learning, pretraining, and fine-tuning.

8. Large Language Models and Post-Training

Autoregressive generation, prompting, in-context learning, instruction tuning, alignment, preference optimization, test-time compute, reasoning-oriented models, mixture-of-experts architectures, parameter-efficient fine-tuning, and task adaptation.

9. Information Retrieval and RAG

Dense retrieval, sparse retrieval, hybrid retrieval, reranking, retrieval-augmented generation, faithfulness, grounded answer generation, hallucination risk, answer verification, and domain-specific RAG for legal and healthcare applications.

10. Turkish NLP and Evaluation

Turkish morphology, tokenization issues, low-resource challenges, Turkish downstream tasks, Turkish LLM benchmarks, culturally grounded evaluation, domain-specific benchmark construction, fairness, validity, and reproducibility.

Assessment Pattern

Assessment varies by semester. A recent graduate offering used a combination of quizzes, paper presentations, project work, a midterm exam, and a final exam. Exact percentages are always announced in the official course platform.

15%

Quizzes in total, often distributed across multiple short exams.

10%

Paper presentations, normally focused on recent NLP or LLM research.

20%

Term project, including datasets, code, report, and presentation-quality explanation.

55%

Midterm and final exams, including conceptual and calculation-oriented questions.

Exam preparation normally requires students to understand formulas and calculations rather than only memorize definitions. Students are expected to be able to compute or interpret cosine similarity, sigmoid, softmax, entropy, derivatives, cross-entropy, perplexity, and evaluation metrics.

Paper Presentations

Graduate students are expected to read and present research papers. A strong paper presentation should not merely summarize an abstract; it should explain the research problem, the gap in the literature, the model or method, datasets, experimental design, metrics, baselines, main results, limitations, and possible extensions.

Typical Paper Themes

Hallucination detection and correction in LLMs.
Turkish LLM evaluation, benchmarks, and dataset construction.
RAG, retrieval verification, and domain-grounded generation.
Instruction tuning, alignment, and test-time computation.
Mixture of Experts, sparse expert models, and expert routing.
Transformers, masked language models, and contextual representations.

Recent Term Project Directions

Term projects are research-style and may evolve each year. Recent topics have centered on Turkish LLM evaluation, hallucination detection, domain-specific RAG, and advanced fine-tuning.

Hallucination Detection and Correction in LLMs

Students choose a domain such as healthcare or justice, survey hallucination taxonomies and detection methods, build or adapt a dataset of LLM outputs, label hallucinated versus non-hallucinated responses, implement detection approaches, and propose correction strategies such as retrieval-augmented verification, human-in-the-loop editing, explanation feedback, or domain-specific rules.

Possible methods: uncertainty signals, entailment-based verification, retrieval-based evidence checking, classifier-based detectors, claim decomposition, and LLM-as-judge analysis with caution.
Evaluation: detection accuracy, false positives, false negatives, domain impact, correction success, evidence coverage, and qualitative error analysis.
Domain-specific issues: legal risk, medical safety, source authority, data privacy, explainability, and regulatory constraints.

Modern LLM Benchmarks for Turkish

Students study existing Turkish LLM benchmarks, identify shortcomings, define criteria for improved evaluation, collect or construct new benchmark items, implement an evaluation harness, test selected open models, and analyze where models fail.

Benchmark dimensions: task diversity, Turkish morphology, idioms, cultural context, legal or medical domains, generative tasks, discriminative tasks, few-shot and zero-shot settings, and fairness.
Outputs: dataset/test suite, evaluation code, baseline results, failure taxonomy, and a research report proposing improvements.
Recent public reference points: TurkishMMLU/TR-MMLU, METU Turkish LLM Benchmark, Mukayese/TurkBench-style resources, OpenLLM Turkish Leaderboard, Cetvel, and other Turkish evaluation efforts.

Mixture of Experts and Turkish Downstream Tasks

Earlier project directions asked students to investigate Mixture of Experts ideas, fine-tune task-specific models for Turkish downstream datasets, test each expert on its own task and cross-task settings, and compare individual expert models with an MoE-style or routed system.

Example tasks: sentiment classification, summarization, question answering, natural language inference, named entity recognition, keyword extraction, and text categorization.
Example metrics: ROUGE, BLEU, METEOR, BERTScore, accuracy, F1, and task-specific metrics.
Expected report quality: academic paper style, reproducible experiments, clear model/data cards, and careful discussion of limitations.

Classical-to-Neural Domain NLP Projects

Earlier course projects included domain and dataset selection, collocation and n-gram analysis, domain-specific word embedding training or fine-tuning, keyword extraction, topic detection, simple knowledge-graph construction, text classification, summarization, and zero-shot keyword classification.

Suggested Project Pipeline

1. Problem and Dataset Definition

Define the research question, application domain, dataset sources, licensing constraints, expected outputs, and evaluation protocol. Include a data statement explaining how the dataset was collected or transformed.

2. Literature and Baselines

Review current papers and benchmark resources. Implement simple baselines before introducing large models or complex architectures.

3. Model Development

Implement preprocessing, prompt templates, retrieval modules, fine-tuning scripts, evaluation scripts, and reproducible notebooks or command-line pipelines.

4. Evaluation and Error Analysis

Report quantitative metrics, significance or robustness checks where appropriate, representative examples, error categories, limitations, and practical implications.

5. Research Report

Write an IEEE-style report with abstract, introduction, related work, methodology, experimental setup, results, discussion, conclusion, and references.

6. Reproducible Release Package

Prepare code, datasets or dataset links, README, environment information, model cards, dataset cards, and clear instructions for regenerating results.

Useful Public Resources

Speech and Language Processing, 3rd edition draft Foundations of Statistical Natural Language Processing METU Turkish LLM Benchmark TR-MMLU / Turkish MMLU paper TurkBench: Turkish LLM benchmark OpenLLM Turkish Leaderboard BERTurk / dbmdz Turkish BERT LLM transformer visualization Hugging Face PEFT Sparse expert models survey

Study Questions

The questions below combine recurring quiz/exam themes, textbook reading expectations, and project-oriented reasoning. They are intended as a public study guide rather than a promise of exact exam content.

Statistical NLP, Corpora, and Linguistic Foundations

Compare rationalist and empiricist approaches to language. How do these traditions influence rule-based, statistical, and neural NLP systems?
What does Zipf's Law say about word frequency distributions, and why does it create data sparsity problems?
Define hapax legomena. Why are rare words important in language modeling and corpus analysis?
Explain the difference between open and closed word classes. Give examples in English and Turkish.
Why is morphology especially important for Turkish NLP?
Give examples of lexical ambiguity, structural ambiguity, attachment ambiguity, and non-local dependencies.
What is a corpus? What factors make a corpus useful or misleading for model training and evaluation?
Explain why a purely categorical view of grammaticality is often insufficient for statistical NLP.
How do probabilistic models help address linguistic ambiguity?
What are the limitations of relying only on manually written rules for NLP?

Collocations and N-gram Language Models

Define collocation. How is a collocation different from an idiom?
Why is raw frequency alone insufficient for identifying meaningful collocations?
Explain how part-of-speech filtering can improve collocation extraction.
Compare t-test, chi-square test, and mutual information for collocation discovery.
Why is mutual information sensitive to low-frequency events?
State the chain rule of probability and show how it is approximated in n-gram language modeling.
What is the Markov assumption, and why is it useful?
Compute a simple bigram sentence probability from given unigram and bigram counts.
What is maximum-likelihood estimation? Why does it fail for unseen n-grams?
Compare Laplace smoothing, Good-Turing estimation, back-off, and interpolation.
Define perplexity. Why is lower perplexity generally preferred?
Why does the number of parameters grow rapidly as n increases in n-gram models?

Text Classification, Logistic Regression, and Embeddings

Explain the bag-of-words assumption and its limitations.
Derive the Naive Bayes decision rule for text classification. Which independence assumption does it make?
Why is Laplace smoothing needed in Naive Bayes text classification?
How does logistic regression differ from Naive Bayes conceptually?
Given a small feature vector and weights, compute the sigmoid output of a binary logistic regression model.
Compute cross-entropy loss for a predicted probability and a gold label.
What is L2 regularization, and how does it affect model parameters?
Define the distributional hypothesis.
Compare static embeddings and contextual embeddings. Give one task where each may be useful.
Define tf, idf, tf-idf, PMI, PPMI, and cosine similarity.
How does skip-gram with negative sampling avoid constructing a full co-occurrence matrix?
Why are subword embeddings useful for morphologically rich languages?

Neural Networks, RNNs, LSTMs, and Encoder-Decoder Models

What is the main limitation of feed-forward neural networks for language sequences?
Write the basic RNN recurrence equation and explain the role of the hidden state.
What is backpropagation through time, and why is it required for RNNs?
Explain the vanishing-gradient problem and its practical effect on long-sequence modeling.
What are the forget, input, and output gates in an LSTM? What does each control?
Compare sequence labeling and sequence classification. Give an NLP example of each.
What is bidirectionality in recurrent networks, and when is it useful?
Explain encoder-decoder architectures for sequence generation.
What is the bottleneck problem in basic encoder-decoder models?
How does attention help solve the fixed-context-vector bottleneck?
What is autoregressive generation?
Compare teacher forcing with free-running generation.

Transformers, Masked Language Models, and LLMs

What is the key innovation of the transformer architecture?
Explain queries, keys, and values in self-attention.
How is dot-product attention computed? What does the softmax normalize over?
Why do transformers need positional information?
What is a residual connection, and why does it help training deep networks?
What is layer normalization, and where does it appear in a transformer block?
Compare causal language modeling and masked language modeling.
Describe the masked language modeling objective used in BERT-like models.
What is transfer learning in the pretrain-fine-tune paradigm?
Explain the difference between pretraining, continued pretraining, supervised fine-tuning, instruction tuning, and alignment.
What is in-context learning? How is it different from parameter updates?
What are common reasons LLMs hallucinate?
How can retrieval-augmented generation reduce hallucination risk?
What are the limitations of LLM-as-judge evaluation?
Explain the intuition behind mixture-of-experts models and sparse expert routing.

Evaluation, RAG, Turkish Benchmarks, and Research Projects

Compare BLEU, ROUGE, METEOR, chrF, BERTScore, accuracy, F1, and human evaluation. Which tasks are they suitable for?
Why can word-overlap metrics be problematic for morphologically rich languages such as Turkish?
What does a good benchmark measure besides raw accuracy?
What are common shortcomings of translated benchmarks?
What properties should a better Turkish LLM benchmark include?
How would you design a benchmark for Turkish legal question answering?
How would you evaluate hallucination detection in healthcare or justice domains?
Define evidence coverage, faithfulness, answer correctness, and citation quality in a RAG system.
Explain the difference between sparse retrieval, dense retrieval, and hybrid retrieval.
What is reranking, and why is it useful in RAG?
How would you construct an instruction-style dataset from a traditional classification dataset?
What should be included in an IEEE-style NLP project report?
What should a reproducible NLP project repository contain?
What ethical and safety issues arise when building LLM systems for healthcare and legal domains?