I wrote a weird blog post draft about AI last week that I wasn't sure what to do with. Inspired by listening to the latest ATP in the car today to publish it. One thing that resonated with me: people have strong feelings about AI, but blog posts are a good way to explore what we think about it.
Johnson City Coffee Co. ☕️
Training C-3PO
Many of the hot takes about fair use for AI training are either “AI is stealing content” or “everything on the web is free”, but the discussions in between those extremes are more interesting. Let’s explore it with a thought experiment. This blog post isn't a declaration that I've figured it all out. It's just to get us thinking.
First, review how blogging and fair use has worked since the beginning of the web. Every day I read a bunch of news articles and blog posts. If I find something I want to write about, I’ll often link to it on my blog and quote a few sentences from it, adding my own comment. Everyone agrees this is a natural part of the web and a good way for things to work.
An example outside the web is Cliff Notes. Humans can read the novel 1984 and then write a summary book of it, with quotes from the original. This is fine. It also indirectly benefits the original publisher as Cliff Notes brings more attention to the novel, and some people pick up their own copy.
Now, imagine that C-3PO is real. C-3PO is fluent in six million forms of communication, and he has emotions and personality quirks, but otherwise he learns like the rest of us: through experience.
C-3PO could sit down with thousands of books and web sites and read through them. If we asked C-3PO questions about what he had read, and then used some of that in our own writing, that seems like fair use of that content. Humans can read and then use that knowledge to create future creative works, and so can C-3PO. If C-3PO read every day for years, 24 hours a day, gathering knowledge, that would be fine too.
Is that different than training an LLM? Yes, in at least two important ways:
- Speed. It would take a human or C-3PO a long time to read even a fraction of all the world’s information.
- Scale. Training a single robot is different than training thousands of AI instances all at once, so that when deployed every copy already has all knowledge.
Copyright law says nothing about the speed of consumption. It assumes that humans can only read and create so much, because the technology for AI and even computers was science fiction when the laws were written. Robots and AI cannot only quickly consume information, they can retain all of it, making it more likely to infringe on a substantial part of an original work.
Maybe copyright law only applies to humans anyway? I don't know. When our C-3PO was reading books in the above example, I doubt anyone was shouting: “That’s illegal! Robots aren’t allowed to read!”
The reality is that something has fundamentally shifted with the breakthroughs in generative AI and possibly in the near future with Artificial General Intelligence. Our current laws are not good enough. There are gray areas because the laws were not designed for non-humans. But restricting basic tasks like reading or vision to only humans is nonsensical, especially if robots inch closer to actual sentience. (To be clear, we are not close to that, but for the first time I can imagine that it will be possible.)
John Siracusa explored some of this in a blog post earlier this year. On needing new laws:
Every new technology has required new laws to ensure that it becomes and remains a net good for society. It’s rare that we can successfully adapt existing laws to fully manage a new technology, especially one that has the power to radically alter the shape of an existing market like generative AI does.
Back to those two differences in LLM training: speed and scale.
If speed of training is the problem — that is, being able to effectively soak up all the world’s information in weeks or months — where do we draw the line? If it’s okay for an AI assistant to slowly read like C-3PO, but not okay to quickly read like with thousands of bots in parallel, how do we even define what slow and quick are?
If scale is the problem — that is, being able to train a model on content and then apply that training to thousands or millions of exact replicas — what if scale is taken away? Is it okay to create a dumb LLM that knows very little, perhaps having only been trained on licensed content, and then create a personal assistant that can go out to the web and continue learning, where that training is not contributed back to any other models?
In other words, can my personal C-3PO (or, let’s say, my personal ChatGPT assistant) crawl the web on my behalf, so that it can get better at helping me solve problems? I think some limited on-demand crawling is fine, in the same way that opening a web page in Safari using reader mode without ads is fine. As Daniel Jalkut mentioned in our discussion of Perplexity on Core Intuition, HTTP uses the term user-agent for a reason. Software can interact with the web on behalf of users.
That is what is so incredible about the open web. While most content is under copyright by default, and some is licensed with Creative Commons or in the public domain, everything not behind a paywall is at least accessible. We can build tools that leverage that openness, like web browsers, search engines, and the Internet Archive. Along the way, we should be good web citizens, which means:
- Respecting robots.txt.
- Not hitting servers too hard when crawling.
- Identifying what our software is so that it can be blocked or handled in a special way by servers.
This can’t be stressed enough. AI companies should respect the conventions that have made the open web a special place. Respect and empower creators. And for creators, acknowledge that the world has changed. Resist burning everything down lest open web principles are caught in the fire.
Some web publishers are saying that generative AI is a threat to the open web. That we must lock down content so it can’t be used in LLM training. But locking content is also a risk to the open web, limiting legitimate crawling and useful tools that use open web data. Common Crawl, which some AI companies have used to bootstrap training, is an archive of web data going back to 2008, often used for research. If we make that dataset worse because of fear of LLMs misusing it, we also hurt new applications that have nothing to do with AI.
Finally, consider Google. If LLMs crawling the web is theft, why is Google crawling the web not theft? Google has historically been part of a healthy web because they link back to sites they index, driving new traffic from search. However, as Nilay Patel has been arguing with Google Zero, this traffic has been going away. Even without AI, Google has been attempting to answer more queries directly without linking to sources.
Google search and ChatGPT work differently, but they are based on the same access to web pages, so the solutions with crediting sources are intertwined. Neither should take more from the web than they give back.
This is at the root of why many creators are pushing back against AI. Using too much of an original work and not crediting it is plagiarism. If the largest LLMs are inherently plagiarism machines, it could help to refocus on smaller, personal LLMs that only gain knowledge at the user's direction.
There are also LLM use cases unrelated to derivative works, such as using AI to transcribe audio or describe what's in a photo. Training an LLM on sound and language so that it can transcribe audio has effectively no impact to the original creators of that content. How can it be theft if there are no victims?
I don’t have answers to these questions. But I love building software for the web. I love working on Micro.blog and making it easier for humans to blog. Generative AI is a tool I’ll use when it makes sense, and we should continue discussing how it should be trained and deployed, while preserving the openness that makes the web great.
Joe Biden in a written statement today:
Not the press, not the pundits, not the big donors, not any selected group of individuals, no matter how well intentioned. The voters - and the voters alone - decide the nominee of the Democratic Party. How can we stand for democracy in our nation if we ignore it in our own party? I cannot do that.
If Biden is having trouble communicating on television, maybe he should write more. Seriously. If we want a reality TV star for president, there's the other guy. 🇺🇸
Lunch at LBJ State Park. Picked up a turkey melt sandwich from a food truck in Johnson City, Cast Iron Punk.
Post-debate, post-interview questions
I have mixed feelings about where we go in the Democratic party. I think I'll be relieved if Biden steps aside because it resets everything about this campaign. The people voted for Kamala Harris too and she will be able to articulate the message against Trump more clearly.
On the other hand, Biden has been a very effective president. He never gets the credit he deserves, and this post-debate rollercoaster is no exception.
I was thinking about one line from his interview last Friday. Asked about Mark Warner assembling senators to convince Biden to drop out:
Mark is a good man. [...] He also tried to get the nomination too. Mark and I have a different perspective. I respect him.
Now imagine Trump being asked that question. Trump only cares about himself so he'll quickly attack any perceived disloyalty. Maybe that difference in respect is partly why Biden has been so effective with bipartisan legislation. He's been around a while. He's pragmatic.
After the interview, I watched John Fetterman answer questions about his support for Biden. When Fetterman makes up his mind about something, he sticks with it. He couldn't care less what you think and I sort of love that about him:
Donald Trump is back, and what do Democrats do? We panic and piss our pants. After a bad debate and after 34 convictions — felonies — the Republicans show up and they dress like him and go all-in on Trump.
Maybe we could learn something here and just say, "Stand by our president through this." After 50 years, and after almost four years as a great president, I think he’s entitled to make his case after a debate that we can all agree was rough. But I know what that’s like. I am not the sum total of a bad debate, and certainly the President isn’t either.
Last week I grew increasingly frustrated with the opinions section of The New York Times. It felt like half their home page was opinion, overshadowing the actual reporting. I cancelled my subscription. There are many places on the web to read opinions. More than ever, we need major news outlets to focus on reporting, not influencing. (I'm going with The Washington Post for a little while. Let's see how they do.)
People are worried that Biden might lose. Good, be worried. If more people were worried in 2016, Hillary would be wrapping up her 2nd term right now.
I'm starting to accelerate my challenge to visit all 88 state parks in Texas. Had a few slow months and realized it would take four years to finish at that pace. Way too long. I think two years is about right.
This political cartoon by Dave Whamond is so good. "C'mon man!"
Last post about Biden for a while. Dave Winer makes a point I've wanted to blog about:
- If Biden gets disabled, or dies, before or after the election -- VP Harris steps up.
- Now everyone can relax.
I'm not worried that Biden will die or have to resign halfway through his term. That's why we have a vice president. Kamala will finish the job, make history, then run for re-election. 🇺🇸
Small fan I keep in my laptop bag. Doesn’t completely solve Texas summer but helps a tiny bit. Need to find a USB-C version. ☀️
Exhausting day yesterday so I crashed early last night. Unfortunately a couple of our servers crashed too. 🙁 Looking into options to prevent this.
Here's a screenshot of the new Get Info window in the latest Mac app for Micro.blog, released today. Makes it possible to quickly grab the auto-generated text for uploaded photos on the Mac.
Ben Werdmuller, linking today to a story on Ghost federation:
I'm also convinced there's room for another fediverse-compatible social network that handles both long and short-form content in a similar way to Substack's articles and Notes. If someone else doesn't build that, I will.
Yeah, it's weird how no one has built this and certainly no one has been actively hosting thousands of blogs with long and short-form content, a social timeline, and fediverse integration for years. 🤪
Ben Thompson on Apple joining and then leaving the OpenAI board:
...joining the board of OpenAI emphasizes the narrative that Apple needs OpenAI, suggests that OpenAI is far more integral to Apple Intelligence than it actually is, and puts Apple in the firing line for any future OpenAI controversies. This was a bad idea, and the company is right to back out.
In the short term, it sounds like the OpenAI integration in iOS 18 will ship months before the full Siri + Apple Intelligence is ready. So you could say Apple does need them. Long term, there will be less and less need for a partner.
Sad to see the TUAW archive taken over by... whatever that is. AI-slopified. Brings back a lot of memories, though, going back through the Wayback Machine. The interview of me at WWDC 2007 seems lost to time, into the void of wherever Blip.tv content went. Also a tweet from that week.
Looking at my tweets from 17 years ago, they often had no faves, no replies, and no retweets (which didn't really exist yet). And it was fine. There was value even with very limited engagement, just as there is with blogging today.
Dave Winer replying to my post yesterday:
The way the twitter-alikes do discourse is not the only possible way, and imho, and, as I've said before (in 2007!), most of what passes for discourse on twitter is actually spam, and that goes for Masto, Threads, Bluesky and Facebook (aka FriendFeed).
I've long thought that platforms should be free to evolve in their own ways. Just like we don't need a monoculture of a single dominant platform, we don't need a single UX either.
Morning work at Cherrywood Coffeehouse. You know it's old-school Austin because just saw what looked like a tiny baby scorpion crawling underneath the windowsill. Crept away, so hopefully I'm wrong or it doesn't stick around. ☕️
The latest Washington Post / ABC News poll has the race tied. Voters thought Biden was old before, they still think that. Biden or Harris, the election will be decided by swing voters and misinformation. Millions don't know he is convicted, don't know about Jean Carroll, don't know about Roe. 🇺🇸
I think I'm going to love this USA basketball team. Nice move swapping in Derrick White. 🏀
Love these upcoming covers for A Song of Ice and Fire. I never finished reading past book 2 years ago and would like to pick up the series again.
Insightful post by Jason Fried on "directional" decisions, the kind of decisions that are strategic and bring a lot of other smaller decisions along with them:
Make a directional decision and you’re now pointed this way or that way. Make a directional decision and you either shut something off or open something up. Make a directional decision and you’ll get a hundred no’s for the price of one.
When a product seems rudderless, it's probably because there weren't any of these decisions.
I'm going to watch Biden's press conference in a couple hours. But I already know it won't change anyone's opinion unless Biden magically looks 10 years younger. Curious if reporters will try to ask any questions that aren't about how old he is. NATO summit this week was kind of big news.
Joe Biden has always rambled, and it hurts him as he gets older. But here’s the deal. One candidate sounds confident but everything he says is nonsense. One candidate sounds mumbling but everything he says is correct. C’mon man! 🇺🇸
Working on a couple different things, bouncing between them. One of those weeks where I can't make up my mind what's most important.
Here's a sneak peek at a reorganized theme editor, moving a few things around to fix long-standing usability problems. When it's ready (next week?) should be much easier to do quick edits without a bunch of clicks through managing themes.
Rust judge dismisses the case against Alec Baldwin after prosecutors withheld evidence, saying:
If this conduct does not rise to the level of bad faith, it certainly comes so near to bad faith as to show signs of scorching prejudice.
This case always seemed unnecessary to me. There’s gotta be more pressing issues to tackle.
Just posted a new Core Intuition: A Billion-Dollar Flop. We check in about the Vision Pro, then think about how Apple plays the long game and what we can learn about balancing marketing features vs. app polish.
Saying that Trump is a threat to democracy is not a call for violence. He caused January 6th. He refuses to accept the last election or the next one. He is chaos. Even after being shot, he’s shouting “fight” to the crowd. Everything he touches becomes more dangerous. Vote. 🇺🇸
Just watched Biden's statement on the shooting. Very good. He is the President and our Democratic nominee. To me, the debate about replacing him is over. The campaign is chaotic enough already. I still think he can win, but it's going to be close, and it's going to be work. 🇺🇸
I wish we could have a day or two each month where we all collectively agreed to post about things we enjoy, love, appreciate, or celebrate. No rage farming, click baiting, or rants allowed. Just for one day.
That would be nice! I might fail at this test, but it's a good goal.
My blog has probably shifted way too much to the political, but that's how it goes in an election year. Personal blog, personal opinions. But I've got some new Micro.blog improvements to talk about this week! Starting tomorrow.