U4 Assignment

Deadline: April 17, 11:59pm CEST

Example Notebook

To copy this code and run it yourself, click the link above and navigate: “file > save a copy in Drive”

Errors you might encounter

  • File not found: Be sure to upload the example PDF. If you want to run the LLMs, you must get a Llama AI API key, save it to a .txt file named ‘key,’ and upload that, too.
  • Model does not exist: You have most likely exceeded your API key balance. Talk to the instructor.
  • JSON DECODER EXCEPTION: This seems to be a periodic problem with the Llama API, where inputs above a certain length return an empty JSON. Either try shortening your input or wait several minutes and try again.

Exercise

If you have a specific project of your own, find some example documents and define the data you wish to extract. Use the provided Llama API key (you will only have a small 5$ budget, so be careful) to engineer some prompts of your own and try to extract the desired text. Try a few different prompts and analyze the results for errors. We will discuss your experience in class.

If you do not have a project of your own, look through the provided PDFs. The context for these PDFs is as follows. Across Bavaria, local municipalities provide websites on which can be found scans of applications for building on plots of land. Some of these documents are those ‘building plans,’ but others are supporting or unrelated documents that happened to be linked on those web pages. The task is to iterate through these PDFs and accurately record the regulations and policies by name cited in the building plans. That includes identifying when a document is not a building plan. Most important are the regulations regarding the maximum approved building height and around risk mitigation for flooding risks. Also important is the predicted frequency of major floods that occur in the area. The challenge is to see if you can engineer some prompts that effectively extract these details while minimizing false positives (claims that a certain text is present but is not) and false negatives (where regulations are mentioned, but the system doesn’t detect them). Remember that this is not a graded exercise but rather a practice session for you to experience the kinds of issues that occur in real-world examples.