Lectures |
Topics |
Readings |
Introduction
|
|
Introduction and Boolean Retrieval
[slides] |
- Chapter 1, Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: A. Moffat, J. Zobel, D. Hawking, Recommended reading for IR research students, ACM SIGIR Forum, vol. 39, no. 2, pp. 3-14, 2005.
- See Sergey Brin, speaking on Search, Google and Life, UC Berkeley, Oct. 2005.
|
Information Retrieval Fundamentals
|
|
Text encoding: tokenization, stemming, lemmatization, stop words, phrases
[slides] |
- Chapter 2, Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Bahle, D., Williams, H. E., and Zobel, J. 2002. Efficient phrase querying with an auxiliary index. In Proceedings of the 25th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Tampere, Finland, August 11 - 15, 2002).
|
|
Apache Lucene
[slides] |
|
31/1 |
Dictionaries & Tolerant retrieval
[slides] |
- Chapter 3, Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - practice and experience 25(3), March 1995.
- Optional reading: K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992.
|
|
Apache Lucene
[slides]
[lab files] |
|
7/2 |
Index construction
[slides] |
- Chapter 4, Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Shanks, V. R. and Williams, H. E. 2003. Index construction for linear categorisation. In Proceedings of the Twelfth international Conference on information and Knowledge Management (New Orleans, LA, USA, November 03 - 08, 2003).
- Optional reading: Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (San Francisco, CA, December 06 - 08, 2004).
- See the video of Jeff Dean's (Google Inc) colloquium Google: A Behind-the-Scenes Look at the University of Washington, October 2004; covers aspects of MapReduce and the systems behind the search engine.
|
|
Apache Solr
[slides]
[apache solr 3.5]
|
|
14/2
|
Index compression
[slides]
Vector Space Retrieval
[slides] |
- Chapters 5,6 Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Büttcher, S. and Clarke, C. L. 2007. Index compression is good, especially for random access. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007).
|
|
Hadoop
[slides]
|
|
21/2 |
Vector Space Retrieval &Computing Scores in a complete search system
[slides1]
[slides2]
|
- Chapters 6, 7 Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Zobel, J. and Moffat, A. 1998. Exploring the similarity space. SIGIR Forum 32, 1 (Apr. 1998).
|
|
Hadoop
[hadoop eclipse plugin]
[slides]
[solutions]
[Info for seting up Hadoop on Eclipse]
|
|
28/2 |
Evaluation in IR, Relevance Feedback & Query Expansion
[slides] |
- Chapters 8, 9. Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Anh, V. N., de Kretser, O., and Moffat, A. 2001. Vector-space ranking with effective early termination. In Proceedings of the 24th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (New Orleans, Louisiana, United States).
|
|
Nutch
[slides] |
|
Semi-structured Data and Structured Retrieval |
| 6/3 |
XML retrieval and Querying the Data Web
[slides] |
- Chapter 10, Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze.
- Overview of XML, XPATH, Semistructured data.
- M. Jarrar and M. D. Dikaiakos."Querying the Data Web -The MashQL approach." IEEE Internet Computing, 2010.
- Schlieder T., Meuss, H. "Querying and Ranking XML Documents." Journal of the American Society for Information Science and Technology, 53(6): 489-503, 2002.
- Optional reading: S. Amer-Yahia, M. Lalmas, "XML Search: Languages, INEX, and Scoring." SIGMOD Record, Vol. 35, No. 4, December 2006.
|
|
Tika
[slides] |
|
|
Software retrieval on Clouds & Revision
[slides] |
|
| 16/3 |
Midterm [Midterm Sample]
[Midterm Grades]
|
- Topics: Chapters 1-8, Manning, Raghavan, Schutze.
- The midterm exam will last 90 minutes.
|
Classification and Clustering |
| 20/3 |
Data classification
[slides] |
- Chapters 13, 14. Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: a review. ACM Comput. Surv. 31, 3 (Sep. 1999), 264-323
|
|
Mahout
[slides] |
|
| 27/3 |
Data clustering
[slides] |
- Chapters 16, 17. Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: a review. ACM Comput. Surv. 31, 3 (Sep. 1999), 264-323
- See the video of Ulrike von Luxburg's (Max Planck Institute for Biological Cybernetics) colloquium Lectures on Clustering at the PASCAL Bootcamp in Machine Learning.
- Optional reading: Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: a review. ACM Comput. Surv. 31, 3 (Sep. 1999), 264-323
- See the video of Yee Whye Teh's (University College London) colloquium Hierarchical Clustering at the EPSRC Winter School in Mathematics for Data Modelling.
|
Web Search
|
|
Mahout Clustering
and Classification
[slides] |
|
3/4 |
Web search Basics, Crawling and Indexing
[slides] |
- Chapter 19, 20. Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Optional reading: Search Engine Users: Internet searchers are confident, satisfied and trusting -- but they are also unaware and naive, by Deborah Fallows, Pew Internet Research report, January 23, 2005.
- Optional reading: Kobayashi, M. and Takeda, K. 2000. Information retrieval on the web. ACM Comput. Surv. 32, 2 (Jun. 2000), 144-173.
- Crawling Techniques (Chapter 6, Modeling the Internet and the Web- Probabilistic Methods and Algorithms, by Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003.)
- Optional reading: An Investigation of Web Crawler behavior: Characterization and Metrics. M. D. Dikaiakos, A. Stassopoulou, L. Papageorgiou. Computer Communications, May 2005. Vol. 28, Issue 8, pp. 880-897, Elsevier (available online through Elsevier's portal; locally in pdf).
- Optional reading: Crawling the Infinite Web Baeza-Yates, R.; Castillo, C. , Journal of Web Engineering, February, Volume 6, Number 1, p.49--72, (2007)
|
|
Search Engine Optimization and the new Trends
|
|
24/4 |
Link analysis
[slides] |
- Chapter 21, Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze
- Chapter 5, Mining Massive Datasets, by Anand Rajaraman and Jeff Ullman, Cambridge University Press, 2011
- Optional reading: Sidiropoulos A., Pallis G., Katsaros D., Stamos K., Vakali A., Manolopoulos Y.: Prefetching in Content Distribution Networks via Web Communities Identification and Outsourcing, World Wide Web Journal, 2008
- See the interview with Jon Kleinberg (inventor of HITS algorithm). This interview was made at the KDD 2007 Conference.
- Link Analysis (Chapter 5, Modeling the Internet and the Web- Probabilistic Methods and Algorithms, by Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003.)
- Chapter 13, Google's PageRank and Beyond: The Science of Search Engine Rankings, by Amy N. Langville & Carl D. Meyer.
|
|
Projects Presentation
|
- Presentation of the students' final project
|
|
| |
Final Exams
[Final Grades]
|
|