Large Language Model Researcher

Large Language Model Researcher

Independent Study & Online Computing Research Association (CRA) UR2PhD Program Fall 2023

This research project is currently ongoing as an independent study with a student partner and two faculty advisors. It began in Fall 2023 in conjunction with the Computing Research Association UR2PhD online program, aimed to engage undergraduate women interested in pursuing a doctoral degree in computer science. The UR2PhD course honed my ability to think about a research topic in the context of the modern and advancing computer science field in order to develop insightful methods and findings, and improved my understanding of foundational research skills and methods, which I have honed and practiced throughout writing and presenting a research proposal, as well as doing the following research.

Our task was to investigate the extent to which OpenAI's Large Language Models can accurately quote publicly available online texts in the context of copyrighted training data. My team's goal was to answer the question of whether or not these LLMs can be said to accurately quote works such as these, by testing the ability of both GPT-3.5 and GPT-4 to complete random quotes from various corpora found online. Tools used include the OpenAI API, SpaCy tokenization library, OpenAI tokenizer (Tiktoken), and Python libraries (e.g. Pandas, NumPy, Matplotlib, Seaborn, Sentence Transformers).

Visit the GitHub Project