Growing the Knowledge Gap: OpenAI's New…

Nov 9, 2023

Generative AI technology continues to evolve at breakneck speed... but many people and institutions, including universities, risk being left behind

Read →

3 Comments

Mark Humphries

Nov 9, 2023Edited

That is a really good question and one that has yet to be sorted out.

I think it will depend on a whole variety of factors. Some archival documents will by subject to copyright while many will not. Some archives already have prohibitions on sharing images without their permission and that is usually written into their terms of service or user agreements (the things you agree to on website or the documents you sign when you take pictures at an archive). And it will also vary depending on the software you use and where you use it.

If you run a local LLM like Llama2 on your own computer, which I do for some cases, the ethical questions would be no different than whether you use Adobe or MS Photo to view the images or Excel or SPSS to compile data. You're just using a software program to conduct private research on your own computer. There is no transmission or sharing of the records.

I'm not a legal expert but I suspect the same is true if you use OpenAI's API to process images or data because it is a secure, private gateway and under their terms of service they do not retain any of the information that is sent to the API nor will they use it for training purposes. So if you are using the API for private research, I suspect that the legal and ethical issues would be no different than taking a picture of a document with an iPhone (which stores those images on the cloud), putting a document into word (which is now cloud based too), storing those files on your OneDrive, sending them to Transkribus for transcription, using Google Translate, or putting them into DropBox. In those cases, as long as the LLM does not store or use the data and is secure, its just another piece of software. That said, in my experiments with the API (as distinct from locally run models) I have been careful to use open-access, publicly available, and copyright free archival records. That is also why I used my own book manuscript for the tests above: I kindly gave myself permission to send the entire text to the API.

Inputting images of archival documents or the texts into ChatGPT is a bit dicier as it is not a "secure" interface and unless you have the paid version and change your settings to tell OpenAI not to use your conversations for training purposes, they will. Again, when I do this I am very careful to think about whether I know whether I have the right or not to input the image into ChatGPT.

But here is a really important thing I know that you cannot do: if there are any specific use restrictions on your records or prohibitions against electronic storage and retransmission, especially if your records contain private or classified data or are information subject to the Canadian Privacy Act, you most definitely cannot send them to an external LLM via an API or a browser based program like ChatGPT. But it’s important to keep in mind that for those types of cases, such restrictions are not limited to LLMs: people would also be prohibited from storing those types of records on any cloud based server like iCloud, OneDrive, etc. or using cloud-based software (Adobe, Word, etc) to do things with the documents.

The case of copyrighted materials, including PDFs of books and articles, may sound clearer but it will again depend at least in part on the terms of service under which you acquired them in the first place. Most people probably don’t think about this when they run paragraphs through Google Translate, email a paper to friend, or save a copy of an article to their OneDrive but we are more conscious of it with a new technology like LLMs. Here the law remains very much unsettled which is why authors and creators are suing companies like OpenAI.

An interesting development is that OpenAI also announced on Tuesday a new Copyright Shield program which says: "OpenAI is committed to protecting our customers with built-in copyright safeguards in our systems. Today, we’re going one step further and introducing Copyright Shield—we will now step in and defend our customers, and pay the costs incurred, if you face legal claims around copyright infringement. This applies to generally available features of ChatGPT Enterprise and our developer platform." (https://openai.com/blog/new-models-and-developer-products-announced-at-devday) This is similar to a program that Google unveiled awhile back. It’s a pretty sweeping indemnification which suggests to me that OpenAI is fairly confident that using LLMs to process even copyrighted materials will be upheld in the courts. We’ll see. Whether its ethical to do so is a very different matter.

So a really good question…maybe I’ll turn this into a blog post. I think best practice is to ensure you always adhere to any relevant terms of service and user agreements that apply to your documents. Don’t send legally restricted or classified information to any cloud service. Be mindful of what you are doing with an LLM and where you are using it (locally, via API, or in a web-browser).

Expand full comment

Lytle78

Nov 23, 2023

Heard your podcast and was enthused!! My project is tiny, getting a maybe 500 pages of Homeowners Association Documents Into a trained chat. I have been using two different products: Dante and My AskAI. They both make it pretty easy, but I am concerned about accuracy and your experience shows me "behind the curtain" - helps me understand lots better.

A friend and Neighbor is a Board Member of the Museum of the Fur Trade in Nebraska. I will be bending his ear soon about AI and their documents.

Expand full comment

Mary

Nov 9, 2023

Curious to know more about the ethics of sharing archival docs into tools like ChatGPT, specifically about data-sharing or if any protections or limits are on the files themselves.

Expand full comment

Generative History

Growing the Knowledge Gap: OpenAI's New…