Introduction to Natural Language Processing

Public summary. This page summarizes the recurring structure and recent direction of the course. Classroom links, dates, rooms, office hours, exact deadlines, and grading details may change by semester and are maintained in the official learning management system.

Course Overview

Introduction to Natural Language Processing introduces students to computational methods for representing, analyzing, and generating human language. The course starts with classical statistical NLP and gradually moves toward modern neural methods, transformers, and large language models. A recurring emphasis is placed on Turkish NLP, low-resource language challenges, reproducible experimentation, and practical model evaluation.

In recent offerings, the project component has evolved toward evaluating and fine-tuning small open-source LLMs for Turkish tasks. Students work with public datasets, convert task data into instruction-style formats, run baseline evaluations, apply supervised fine-tuning, and analyze whether the adapted model improves over the base model.

Statistical NLP Text classification Embeddings Neural language models Transformers LLMs Turkish NLP SFT / LoRA / QLoRA

Learning Outcomes

Conceptual Understanding

Explain the main ideas behind statistical NLP, corpus-based analysis, language modeling, embeddings, and neural sequence models.
Compare rule-based, statistical, and neural approaches to language processing.
Understand why Turkish creates distinctive modeling challenges because of morphology, data sparsity, and resource availability.

Technical Skills

Prepare text data, tokenize it, represent it numerically, and train or evaluate NLP models.
Compute and interpret metrics such as accuracy, F1, cross-entropy, perplexity, and baseline-vs-fine-tuned performance.
Use reproducible workflows with Python, GitHub, Hugging Face, Google Colab, and modern fine-tuning tools.

Core Topic Map

The exact order may vary, but the course normally covers the following themes.

1. Introduction and Linguistic Foundations

What NLP is, rationalist vs. empiricist traditions, ambiguity, probability in language, Zipf's Law, corpus-based thinking, morphology, parts of speech, syntactic ambiguity, semantic roles, and basic linguistic terminology.

2. Text Processing and Corpus-Based Work

Regular expressions, tokenization, sentence segmentation, stemming, lemmatization, edit distance, spelling correction, corpus design, word classes, and practical preprocessing issues.

3. Collocations and Statistical Association

Collocations, idioms, frequency-based extraction, part-of-speech filtering, hypothesis testing, t-test, chi-square test, mutual information, and limitations of low-frequency evidence.

4. N-gram Language Models

Chain rule, Markov assumption, unigram/bigram/trigram models, maximum likelihood estimation, sparse data, smoothing, back-off, interpolation, deleted interpolation, perplexity, and language model evaluation.

5. Text Categorization and Classical Machine Learning

Naive Bayes, logistic regression, sentiment analysis, text classification pipelines, bag-of-words representations, feature weighting, regularization, and evaluation of classification models.

6. Vector Semantics and Embeddings

Distributional hypothesis, sparse and dense vectors, tf-idf, PMI, PPMI, cosine similarity, document centroids, word2vec, skip-gram with negative sampling, fastText, and intrinsic/extrinsic evaluation.

7. Neural Language Models

Feed-forward neural networks, activations, softmax, gradients, cross-entropy, multi-layer networks, neural language models, RNNs, LSTMs, bidirectionality, attention, encoder-decoder models, and sequence labeling vs. sequence classification.

8. Transformers and Large Language Models

Self-attention, masked language modeling, causal language modeling, transformer architecture, pretraining, prompting, in-context learning, instruction tuning, alignment, and the role of retrieval and evaluation in modern LLM systems.

9. Turkish NLP and Applied LLM Projects

Turkish datasets, instruction-data construction, supervised fine-tuning, parameter-efficient fine-tuning, model comparison, benchmark evaluation, error analysis, reproducibility, model cards, and dataset cards.

Recent Reading Structure

Recent offerings have combined classical statistical NLP readings with the latest public draft of Speech and Language Processing. Older classroom announcements used the 2025 chapter numbering; the public 2026 draft reorganizes the LLM-related chapters, so this public page uses topic names rather than depending only on chapter numbers.

Classical Statistical NLP

Manning & Schütze, Foundations of Statistical Natural Language Processing: Introduction.
Linguistic essentials and corpus-based work.
Collocations and statistical association tests.
N-gram models, sparse data, smoothing, back-off, and interpolation.

Modern NLP and LLMs

Jurafsky & Martin, Speech and Language Processing: words and tokens, n-gram models, text classification, embeddings, neural networks, LLMs, transformers, post-training, masked language models, RAG, machine translation, and RNNs/LSTMs.
Selected papers and technical notes on word2vec, fastText-style efficient text classification, Turkish LLM evaluation, and parameter-efficient fine-tuning.

Assessment and Course Work

Assessment varies by semester. A recent grading pattern used quizzes, a midterm exam, a term project, and a final exam. One recent example was: quizzes 15%, midterm exam 30%, project 15%, and final exam 40%.

15%

Quizzes

Reading-based checks and short technical questions.

30%

Midterm

Statistical NLP foundations, classification, embeddings, and calculations.

15%

Project

Team-based NLP or LLM experimentation with reproducible deliverables.

40%

Final

Broader coverage including neural models, transformers, LLMs, and project-related concepts.

This breakdown is shown to help readers understand the course style; the official grading scheme for a given semester is announced separately.

Exam Preparation Themes

Students are expected to understand concepts and perform manageable hand calculations. Scientific calculators are normally useful for operations such as sigmoid, softmax, gradients, log probabilities, and cross-entropy.

Conceptual Questions

Explain why sparse data is a major problem in n-gram language modeling.
Compare rule-based tokenization, BPE-style subword tokenization, and morphological analysis.
Explain why cosine similarity is usually preferred over raw dot product in vector semantics.
Compare n-gram, feed-forward neural, RNN, transformer, masked, and causal language models.
Explain the difference between a causal language model and a masked language model.

Computational Questions

Compute simple n-gram probabilities, smoothed probabilities, sentence likelihoods, and perplexity.
Calculate tf-idf, PMI/PPMI, cosine similarity, and document centroids.
Compute sigmoid or softmax outputs and cross-entropy loss.
Count neural network or RNN parameters from given matrix dimensions.
Analyze baseline vs. fine-tuned model results using evaluation tables.

Term Project: Recent LLM Fine-Tuning Direction

Recent projects ask students to evaluate and fine-tune small open-source LLMs for Turkish. The project is designed to connect NLP theory to realistic model-development practice while keeping the work reproducible and comparable across groups.

Typical Objective

Select two small open-source LLMs, usually in the approximate 0.3B to 4B parameter range.
Evaluate both base models on a shared Turkish public test corpus.
Select one model for supervised fine-tuning.
Fine-tune using a public SFT corpus, typically with LoRA or QLoRA-style parameter-efficient methods.
Re-evaluate on the same held-out test corpus and analyze improvements, failures, and error patterns.

Recent Dataset Examples

Turkish Legal QA

Renicames/turkish-law-chatbot is a Turkish legal question-answering dataset with train/test splits. It is useful for Turkish legal NLP, legal QA, and domain adaptation experiments.

Turkish Medical QA

alibayram/doktorsitesi is a Turkish medical question-answer dataset collected from doctor-patient Q&A content. It is useful for Turkish medical NLP and domain-specific chatbot experiments. Access conditions and licensing should be checked before use.

Typical Deliverables

Progress presentation
Short slides explaining selected models, baseline evaluation, dataset preparation, and fine-tuning plan.

Progress summary
Brief PDF report with group members, selected models, test corpus, SFT corpus, and baseline table.

Final report
IEEE-style LaTeX report covering experiment setup, preprocessing, fine-tuning method, hyperparameters, results, error analysis, and conclusions.

Final presentation
Concise technical presentation of model selection, SFT setup, before/after results, sample outputs, and error analysis.

Code repository
GitHub repository with evaluation scripts, fine-tuning scripts, inference scripts, preprocessing code, helper utilities, and README.

Dataset links document
Dataset names, identifiers, sources, descriptions, licenses/access conditions, and how each dataset was used.

Project Evaluation Dimensions

Correct baseline setup and fair model comparison.
Proper separation of training and test data; the test set must remain unseen during training.
Quality of instruction-data transformation and preprocessing.
Appropriate fine-tuning configuration and hyperparameter reporting.
Clear before/after evaluation and error analysis.
Reproducible code, clean README, and meaningful commit history.
Quality of academic writing and presentation.

Practical Infrastructure

Students are encouraged to use modern open-source NLP infrastructure and to document their work professionally.

Experimentation

Python, PyTorch, Hugging Face Transformers/Datasets, Google Colab, and GPU-aware batch-size management.

Fine-Tuning

Unsloth, LoRA/QLoRA-style parameter-efficient fine-tuning, instruction datasets, and checkpoint management.

Evaluation

Turkish LLM benchmark tools, classification metrics, qualitative output analysis, and reproducible evaluation scripts.

Instruction Dataset Construction

A common project task is to transform a downstream NLP dataset into an instruction-style JSON format. The structure below illustrates the expected idea.

[
  {
    "instruction": "Aşağıdaki yorumun duygu durumunun olumlu veya olumsuz olduğunu söyle.",
    "input": "Çok heyecanlı bir filmdi, çok beğendim.",
    "output": "olumlu"
  },
  {
    "instruction": "Aşağıdaki metni özetle.",
    "input": "...uzun metin...",
    "output": "...kısa özet..."
  },
  {
    "instruction": "Büyük dil modelleri neden önemlidir?",
    "input": "",
    "output": "Büyük dil modelleri, insan dilini anlama ve üretme yetenekleri nedeniyle önemlidir."
  }
]

Students should keep the transformation rule explicit, validate output quality, remove duplicates where necessary, preserve clean train/test separation, and document dataset sources.

Useful Public Resources

Speech and Language Processing, 3rd ed. draft Foundations of Statistical Natural Language Processing word2vec Explained Bag of Tricks for Efficient Text Classification Transformer / LLM visualization Unsloth documentation Turkish LM evaluation harness Marmara-NLP on Hugging Face Turkish legal QA dataset Turkish medical QA dataset

Course Policies Frequently Emphasized

Students should follow official deadlines carefully; confusing AM/PM deadline times or uploading at the last minute can create avoidable problems.
Group members are expected to contribute actively, and GitHub commit history may be used as evidence of individual contribution.
Attendance requirements and signature-based tracking may be enforced according to university and course policy.
Project artifacts should be reproducible, well documented, and suitable for later academic or portfolio use.
Model and dataset cards are encouraged when projects produce reusable models or datasets.

Who Should Take This Course?

The course is suitable for students who want to understand both the mathematical foundations of NLP and the current practice of building NLP/LLM systems. It is especially relevant for students interested in Turkish NLP, text classification, information extraction, generative AI, language model evaluation, legal AI, healthcare AI, and applied data science.

Back to Teaching