This is the final project of INFSCI 2140 Information Storage and Retrieval. We built a search engine of movie lines. Users can input a phrase or a complete sentence and it returns a list of best matched movies with excerpts.

Teammates

Zijian Xu
Long Yan

Data Source

https://nlds.soe.ucsc.edu/fc2 Film Corpus 2.0

It contains 960 film scripts including dialogues and scene descriptions.

Tools

Backend: Spring MVC
Frontend: Angular 7
Cloud & Deployment: Microsoft Azure Cloud
Library: Apache Lucene

Implementation

Data Processing

Remove space lines and special characters
Tokenize and normalize the terms with Porter Stemming

Data Pipeline

Indexing – Index Writer
- Fetch movie script documents from cloud
- Write index files with Lucene into RAM
- Run once with starting the server
Searching – Index Reader
- Tokenize query words
- Search for the documents with Lucene
- Get surrogate fragments with Lucene highlighter

Use Case

Users input a phrase or sentence like “show me the money”, and our system outputs:

Searching results

References

Film Corpus

Walker, Marilyn A., Ricky Grant, Jennifer Sawyer, Grace I. Lin, Noah Wardrip-Fruin, and Michael Buell. “Perceived or Not Perceived: Film Character Models for Expressive NLG.” BEST PAPER AWARD. In International Conference on Interactive Digital Storytelling (ICIDS), Vancouver, Canada, 2011.
Marilyn A. Walker, Grace I. Lin, Jennifer E. Sawyer. “An Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style.” In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 2012.