IR Project -- Movie Script Search

Catalogue
  1. 1. Introduction
    1. 1.0.1. Teammates
  • 2. Data Source
  • 3. Tools
  • 4. Implementation
    1. 4.1. Data Processing
    2. 4.2. Data Pipeline
  • 5. Use Case
  • 6. References
    1. 6.0.0.1. Film Corpus
  • Introduction

    This is the final project of INFSCI 2140 Information Storage and Retrieval. We built a search engine of movie lines. Users can input a phrase or a complete sentence and it returns a list of best matched movies with excerpts.

    Teammates

    • Zijian Xu
    • Long Yan

    Data Source

    https://nlds.soe.ucsc.edu/fc2 Film Corpus 2.0

    It contains 960 film scripts including dialogues and scene descriptions.

    Tools

    • Backend: Spring MVC
    • Frontend: Angular 7
    • Cloud & Deployment: Microsoft Azure Cloud
    • Library: Apache Lucene

    Implementation

    Data Processing

    1. Remove space lines and special characters
    2. Tokenize and normalize the terms with Porter Stemming

    Data Pipeline

    Data Pipeline

    • Indexing – Index Writer

      • Fetch movie script documents from cloud
      • Write index files with Lucene into RAM
      • Run once with starting the server
    • Searching – Index Reader

      • Tokenize query words
      • Search for the documents with Lucene
      • Get surrogate fragments with Lucene highlighter

    Use Case

    Users input a phrase or sentence like “show me the money”, and our system outputs:

    Searching results

    References

    Film Corpus