Skip to main content

Generative AI

A course from the Orbis Cascade Alliance on using GenAI for library work

Assessment & Policies: Evaluating Generative AI Tools for Use in Higher Ed

Authors: Electra Enslow (University of Washington) and Norman Lee (University of Idaho)

This lesson aims to help librarians assess AI tools their institutions are thinking about adopting, as well as tools they have already adopted and wish to monitor. Since a lot of assessment is project dependent, this module strikes a balance between discussing other institutions’ experiences with AI—which may or may not apply to a reader’s institution—and providing an overview of the concepts/tools institutions frequently use to assess AI. Readers are then free to pick and choose which are most appropriate for them. To make the material more concrete, it concludes with some suggestions about how to pick and choose, as well as mini-assessment example.

Recording and Materials

The recording and slides for this lesson will be posted here after April 24, 2026.

In This Lesson

More Reasons to Care About Assessment

The previous modules demonstrate that AI tools are not a panacea and should not be adopted uncritically. Each institution is unique, so any AI tool should be assessed on those merits. However, there is some preliminary research on how successful AI adoption tends to be, which can provide useful context when deciding how much effort to put into assessment.

Systematic studies of AI adoption in academic libraries are rare. However, broader analyses done by McKinsey & Company (2025), MIT’s NANDA lab (Challapally, Pease, Raskar, & Chari, 2025), and Harvard Business School (Dell’Acqua et al, 2026). provide broader analyses, including “Knowledge workers” that could apply to libraries. For example, despite much discussion, they find that only ~5–7% of institutions have formally, systematically integrated AI into their workflows. ~80% of individuals experiment with or regularly use AI, but enterprise-level adoption is rarer, often failing at the pilot stage. ~60% of institutions have at least investigated or experimented with enterprise tools, however only ~20–30% move to pilot them. This can be due to:

On the other hand, organizations which successfully adopt AI tools tend to have the following characteristics:

Several strategies are used much more by successful AI adopters: human in the loop, building technology infrastructure, clearly defined AI road maps, leadership buy-in, embedding AI solutions into existing processes, agile product delivery, strategic workforce planning, iterative solution development, and rapid development cycles.

Of those that do adopt AI, 39% believe AI quantifiably decreased costs, usually by 10% or less. However, they also report qualitative improvements in things like innovation, customer/employee satisfaction, and cost, noting that because AI tools are integrated into larger workflows their specific effects can be difficult to tease out.

Common Assessment Concepts

For those who wish to adopt generative AI tools, there is no one-size-fits-all approach to assessment. Below is a list of common assessment concepts. It can be overwhelming, but remember that not every priority will be relevant for every product and, in many cases, a vendor or external partner may have already taken care of one or more points. Below the list are some general guidelines on how to determine which concepts are appropriate for an institutions’ use-case and how they should be weighted relative to one another.

AI Performance Concepts

These concepts cover how the AI performs as a standalone tool, not taking into account whether it fits any institutions’ specific policies & procedures.

To make each concept more concrete, they include examples of how they might be applied to three different generative AI programs, a chatbot (like ChatGPT), a coding agent/assistant (like Claude Code), and an AI assisted information retrieval (aka, RAGs like Consensus or Elicit).

AI Integration Concepts

These concepts cover how easy/hard it is to integrate the tool into existing policies & procedures, monitor it, and make any necessary changes, regardless of how well the AI tool performs.

Picking Which Assessment Concepts To Use

Hatton (2008) summarizes four common methods of prioritizing software requirements.

First, a simple ranking method, where requirements are ordered 1…n with one being the most important and n being the least. This may only be appropriate for cases with ~7 priorities because humans are limited in how much information they can process simultaneously.

The MoSCoW method sorts priorities into four groups:

The $100 method asks decision makers to split $100 between all their priorities, with higher priorities getting the most money. If there are no clear, standout priorities, a second $100 can be split but only in one pile of $50 and two piles of $25, which forces decisionmakers to select “standout” priorities when things are equivocal.

The Analytic Hierarchy Process (AHP) compares each requirement to every other requirement (accuracy and relevance, accuracy and safety, etc..), picks the more important priority within each pair (e.g., safety is more important than accuracy), and assigns a number between 1 and 9 to how much more important (e.g., 1 - safety is barely more important than accuracy; 9 - safety is extremely important compared to accuracy). Weighted scores for each priority can then be calculated.13 If decision makers are attempting to choose between 2 or more AI tools, rather than how to build/adopt a single one, each tool must be compared to each other tool on each of the priorities.14 Hatton suggests AHP only in larger, complex projects with potentially conflicting priorities.

Rubrics

Using a rubric, especially if you are discovering new tools and workflows, can give you a clear criteria such as accuracy, privacy, usability, cost, integration, or compliance and lets you compare tools side‑by‑side using the same standards. Additionally, a good rule of thumb in assessing is to follow the money. How are they using your information? Are they incorporating advertising in their output? This is more prevalent in OpenAI systems but always read the user agreements.

The following rubric from the Academic Senate for California Community Colleges (ASCCC) is highly applicable to academic libraries for evaluating AI tools: Evaluating Artificial Intelligence (AI) Tools in an Academic Setting.

Comparing Ollama and ChatGPT for Library Use

Let’s assess Ollama and ChatGPT for developing chatbots using transparency, usability and scalability from the ASCCC rubric.

Ollama lets you run AI models directly on your own computer, so your data never leaves your device and can avoid ongoing subscription or API costs. ChatGPT runs in the cloud, giving you access to more powerful models but requiring your data to be sent to external servers and paid for based on usage.

When deciding whether to use Ollama tools in a library or research environment, one of the biggest questions is how well it protects sensitive information. Ollama runs AI models locally on your own computer or server, not on someone else’s cloud system. Because of this, anything you type into it stays inside your organization’s own network. That matters if your team sometimes handles information covered by HIPAA (which protects health information) or FERPA (which protects student education records). Using a tool like Ollama can make it easier to follow these rules, because the data never leaves your systems unless you choose to send it somewhere.

ChatGPT, on the other hand, runs on OpenAI’s servers, not your own. That makes it very convenient, easy to use, and generally more powerful out of the box. But because it’s cloud‑based, organizations usually set limits on what staff can put into it. For example, most institutions do not allow entering anything that could identify a patient (HIPAA) or a student (FERPA).

Transparency

Ollama

Ollama tools are generally more transparent because they use open‑source models (like Llama, Mistral, Phi, etc.) and software that libraries can inspect, document, and control. Staff can see which model is running, how it was configured, and where all data and logs live. This aligns well with library values around openness, explainability, and user privacy especially when dealing with policies, metadata creation, or research support. Because it runs locally, libraries also have full visibility into what information is or isn’t leaving their systems.

ChatGPT

ChatGPT is more opaque because it is a proprietary, cloud based system. Libraries cannot see how the underlying model is trained, how it reasons, or how it transforms data internally. Policies and documentation exist, but the inner workings are not open for review or audit. For everyday tasks this may not matter, but for formal library work especially anything touching research integrity, metadata generation, information related to HIPAA, FERPA or other sensitive information this lack of transparency can make governance more complicated.

Comparison

Ollama: High transparency, aligns with open‑knowledge values.

ChatGPT: Low transparency, relies on vendor assurances.

Usability

Ollama

Usability depends on the setup. On its own, Ollama is a developer‑oriented tool. It works best for staff who are comfortable with basic technical steps (installing software, running commands, or using a simple UI built on top). Libraries may need to provide training or a custom front end before it becomes easy for general staff to use. Once configured, however, it works smoothly and can be integrated into local workflows and institutional systems.

ChatGPT

ChatGPT based development tools are extremely easy to use. Open the website or app, type a development prompt, get a tool. No installation, no configuration, no hardware requirements. For busy library staff, public services, instruction, outreach this ease of use is a major advantage. It’s also designed for non‑technical users, with a polished interface, multimodal features, and built in support. But it also requires continuous checking for misinformation because of the lack of transparency.

Comparison

Ollama: Usable with setup; best for staff who can handle light tech steps or when a library builds a simple interface.

ChatGPT: Highly user‑friendly immediately; minimal training needed.

Scalability

Ollama

Scalability depends on library-owned hardware. If multiple staff use Ollama tools, the library must decide whether to install it on personal devices, a shared server, or a campus‑hosted VM with GPUs. Local models can be lightweight or heavy, but large models require more computing power. Scaling across departments may require coordination with IT or research computing. For small groups or internal workflows, it scales well; for institution‑wide offerings, planning is needed.

ChatGPT

ChatGPT scales effortlessly because OpenAI handles all the computing. Whether one librarian or a thousand build with it at once, performance remains the same. For large organizations with many users or for public service areas where demand may spike unpredictably, ChatGPT scales far more easily. But this does come at a financial cost. Your institution may have continuing costs with ChatGPT due to things like increased token usage or purchasing upgraded plans.

Comparison

Ollama: Scales well for small teams; campus‑wide scaling requires IT support and good hardware.

ChatGPT: Automatically scales to any number of users with no extra effort.

Overall Takeaways for Libraries

Ollama is strongest when libraries prioritize:

ChatGPT is strongest when libraries prioritize:

References

AID Statements

Footnotes

  1. Hallucination rates depend on model and task. For instance, compare Vectara’s (2026) proprietary leaderboard with OpenAI, Google, and Anthropic’s self-reported statistics (Anand, 2025). For this reason, it’s important to benchmark based on data from your individual use-case. 

  2. See Precision and recall (Wikipedia, 2026) for calculations 

  3. Relevance can be assessed mathematically by using measures of ‘similarity’ such as cosine-similarity (Krantz & Jonker, n.d.). However, these measures only say two vectors are similar according to one particular mathematical function, nor do they tell us why. Accordingly, it can be helpful to replace or supplement mathematical techniques with more qualitative ones, such as grounded theory (Hecker & Kalpokas, n.d.). 

  4. See Precision and recall (Wikipedia, 2026) for calculations 

  5. Assessing robustness/reliability can be time consuming if an external partner (like a vendor) hasn’t done it already. EvidentlyAI (2025), the creators of LLM-as-judge, provide some ways to automate this.. 

  6. Since LLMs are mostly trained on data created by humans, they can inherit human biases (Hall, 2025). 

  7. The more popular a language is, the more training data an LLM will have for it. Common benchmarks also have biases. E.g., the very popular SWE-bench (Jimenez et al., 2024) includes mostly python. 

  8. LLM’s can do harm in a variety of ways. For instance, a user attempting to get an LLM to prompt an LLM to output something harmful is called ‘adversarial prompting’. AWS has compiled a list of common adversarial prompt strategies (AWS, n.d.) and see Kaiyom et al. (2024) for a recent safety leaderboard. Coding agents can also create vulnerabilities in programs they design (Kaminsky, 2025), and any web-based LLM applications need to include standard malware protections. For an overview of how to approach AI risk assessment, see Koessler & Schuett (2023). 

  9. Intentional, ongoing AI governance is critical to maintain alignment (Ferrari et al., 2025; UNESCO, 2023) 

  10. There is no consensus on the overall impact/cost/profit of AI is still very controversial. Again, it seems to vary on a case-by-case basis. See Challapally et al. (2025), Dell’Acqua et al. (2026); McKinsey (2025); Patwardhan et al. (2025). 

  11. This can apply to transparency of the companies that create AI models - see Wan et al. (2025) for a recent index - but for practical purposes more frequently refers to a model’s implementation in any specific application. VS Code is compatible with many AI tracing tools (Visual Studio Code, n.d.) and there are also specially designed tools like Arize Phoenix (n.d.). 

  12. Again, see Ferrari et al. (2025) and UNESCO (2023). 

  13. E.g., using specialized AHP software (Creative Decisions Foundation, n.d.) or python packages like PhillipGriffith’s AHPy. 

  14. For further details and a worked example see Analytic hierarchy process - car example (2025) on Wikipedia.