This text-generation algorithm is supposedly so good it’s frightening. Judge for yourself.

If you’ve ever wondered what “Nieman Lab written by robots” would look like, this is the piece for you.

By Joshua Benton @jbenton Nov. 7, 2019, 11:12 a.m.

The best weapons are secret weapons. Freed from the boundaries of observable reality, they can hold infinite power and thus provoke infinite fear — or hope. In World War II, as reality turned against them, the Nazis kept telling Germans about the Wunderwaffe about to hit the front lines — “miracle weapons” that would guarantee victory for the Reich. The Stealth Bomber’s stealth was not just about being invisible to radar — it was also about its capabilities being mysterious to the Soviets. And whatever the Russian “dome of light” weapon is and those Cuban “sonic attacks” are, they’re all terrifying.

So whether intentionally or not, the creators of the text-generating algorithm GPT-2 played the PR game brilliantly in February when they announced that, well, it just may be too powerful to release to the general public. That generated a wave of global publicity that is, shall we say, uncommon for new text-generating algorithms. (Elon Musk is involved, you’ll be shocked to learn.)

In any event, now, nine months later, the folks at OpenAI have apparently decided that the infopocalypse is not right around the corner and released its secret superweapon GPT-2 into the wild. They say they have “seen no strong evidence of misuse so far” from more limited releases of the technology.

We're releasing the 1.5billion parameter GPT-2 model as part of our staged release publication strategy.
– GPT-2 output detection model: https://t.co/PX3tbOOOTy
– Research from partners on potential malicious uses: https://t.co/om28yMULL5
– More details: https://t.co/d2JzaENiks pic.twitter.com/O3k28rrE5l

— OpenAI (@OpenAI) November 5, 2019

The alleged threat is not, as some journalists have feared, that this machine is going to eventually cover city council meetings and aggregate viral tweets more efficiently than a human reporter could. It’s that the ease of generating semi-convincing output could make it efficient to pump out thousands or millions of individually tailored misinformation, spam emails, or some other text-based form of grift.

I suppose that’s something legitimate to worry about, but my experiences playing around with the GPT-2 has mostly matched this guy’s: Creating readable prose, sensical prose, and believable prose are still pretty different things.

The full GPT-2 model is finally out.

I dunno, guys. pic.twitter.com/k5dP8V2sa5

— Tomer Ullman (@TomerUllman) November 5, 2019

That something sounds/reads like it was generated by GPT-2 is an interesting new kind of an insult.

— Andrej Karpathy (@karpathy) November 6, 2019

To test out its capabilities, I thought I’d feed it the ledes of a few Nieman Lab stories and see what it autogenerated from there. For each of these below, the actual human-written text is in italics; everything after that is “by” GPT-2. (Note that GPT-2, as used on Talk to Transformer, often ends its output mid-sentence.)

🤖 🤖 🤖

Original story: “Meet The Salt Lake Tribune, 501(c)(3): The IRS has granted nonprofit status to a daily newspaper for the first time,” by Christine Schmidt, published Nov. 4.

It was a “happy surprise,” Fraser Nelson said, when The Salt Lake Tribune received a letter from the IRS on Friday giving the 148-year-old news outlet nonprofit 501(c)(3) status — no questions asked.

The Tribune, like all nonprofit news organizations in Utah, is also required by IRS code to register with the agency under the section 501(c)(4) that allows “social welfare” nonprofits like the Tribune to accept donations, make grants and engage in the business of journalism.

The Tribune’s board, however, approved the decision to not register under 501(c)(3), and instead register under 501(c)(4), a section that provides tax-exempt status to religious organizations.

“We felt that our reporting is political in nature and not editorial, so we felt the tax exemption was needed to protect the integrity of our journalism,” Nelson said.

According to IRS code, 501(c)(4) organizations can engage in political, educational or religious activities, but not engage in political campaign activity or engage

Verdict: This is in the broad ballpark of news prose style. It’s interesting it chose to contradict the lede so quickly — specifically saying it rejected 501(c)(3) status. The real Fraser Nelson wouldn’t like that quote, I imagine. Call it a B+.

🤖 🤖 🤖

Original story: “Watch your language: ‘Data voids’ on the web have opened a door to manipulators and other disinformation pushers,” by me, published Nov. 1.

One day fifteen long years ago, in 2004, some SEO consultants decided to have a contest to determine quién es más macho in the gaming-search-results game. As they put it: “Are you a Player or a Stayer?”

At the time, Google and Bing were using a variety of techniques to determine whether search results for a given query were “Google-friendly.”

The main criterion was the number of Web pages containing the search string.

In the contest, the contest participants decided to make a bolder claim: “No, you’re not a player.”

They set a goal: a “threshold.”

The goal: to make the Google ranking algorithm think that a page was being searched for in the way you might expect it to be.

They wanted the Google algorithm to think the page was being used, and not just viewed.

The results of their contest were remarkable.

The

Verdict: Choppy short sentences without much flow or sense, despite sticking in the general topic area. I was sorta hoping it would throw in some more Spanish. I’ll give it a C-.

🤖 🤖 🤖

Original story: “The Daily Wire and black salve show that Facebook’s takedown policies have one thing in common: Inconsistency,” by Laura Hazard Owen, published Nov. 1.

Inauthentic coordinated behavior, in the U.S.? One of modern media’s mysteries the past few years has been: How does the right-wing website The Daily Wire do so incredibly well on Facebook?

The site is so successful that its co-founder, Ben Shapiro, now regularly speaks to thousands of college students in Washington, D.C. and, most recently, at the Heritage Foundation. Yet, the site has been criticized for its lack of coverage of domestic issues. The problem is that the right-wing site has a left-leaning bias: its stories often criticize left-wing figures such as President Barack Obama.

In an article for The Wall Street Journal, conservative writer Ben Shapiro said that Facebook users are “left-leaning” — and, thus, the site should have more of a bias toward the right.

The Daily Wire, a right-wing news site, is the most-shared conservative site on Facebook and has more than 30 million users. Its Facebook page has more than 11 million likes and

Verdict: Pretty realistic at the sentence level, and maybe the self-contradicting “The problem is that the right-wing site has a left-leaning bias” is actually a galaxy-brain commentary on our current politics. A solid B.

🤖 🤖 🤖

Original story: “Why I’m starting a company to build out a new model for local news in the 2020s,” by Ken Doctor, published Oct. 25.

Over the past decade here at Nieman Lab, I’ve reported a lot of news industry news. Today, I’m sharing some of my own. This morning, I’m posting the latest story on the news industry.

“As the story behind today’s story is well known, the National Security Agency was listening to telephone calls made in Germany by the German Chancellor Angela Merkel. This story comes from an important source: German journalist Ulf Laessing. A few days ago, he told me that a former U.S. intelligence officer had revealed this to him. He gave me the name of the source, but the American intelligence officer has not been seen publicly since.”

– Edward Snowden, 2013 interview with Glenn Greenwald,

the Guardian

“The NSA has been listening to the Chancellor of the German Federal Republic [of Germany] Angela Merkel’s phone calls since 2008. They also spied on the phones of other senior leaders of the G20, including

Verdict: Um, no. There’s no mention of Germany in Ken’s piece and I can’t quite figure out what would make the algorithm want to jump straight to Snowden and Merkel. The fractured lines would make this easy to spot. F.

🤖 🤖 🤖

And finally, to go full ouroboros, let’s feed the lede of this story right here into GPT-2.

Original story: This one, silly

The best weapons are secret weapons. Freed from the boundaries of observable reality, they can hold infinite power and thus provoke infinite fear — or hope. In World War II, as reality turned against them, the Nazis kept telling Germans about the Wunderwaffe about to hit the front lines — “miracle weapons” that would guarantee victory for the Reich. When the Germans couldn’t take it, they bombed their own cities, as well as a small number of Allied cities.

Today, there are many secret weapons in use around the globe — weapons that, if deployed, would have immense consequences. A weapon that can cause such destruction and pain would be a threat not just to its users, but to humanity as a whole.

The Pentagon has repeatedly refused to acknowledge the existence of secret weapons — weapons it would be a crime for the world’s media to disclose to the public. In the past, these weapons have been used to wage undeclared wars, including those in Cambodia, Laos, and Afghanistan. These weapons can kill more innocent civilians than nuclear weapons,

Verdict: I’m sure it was right about to pivot back to text-generation algorithms in a couple more grafs. But this is a very reasonable continuation of the lede (other than that first sentence). B.

🤖 🤖 🤖

GPT-2 is not coming to take the jobs of journalists, as some have worried. Paid reporting jobs generally require a certain level of factuality that the algorithm can’t match.

Is it coming for the “jobs” of fake-news writers, those Macedonian teens who until now have had to generate their propaganda (gasp!) by hand? Probably not. Whether your intention is to make money off ad arbitrage or to elect Donald Trump as president of the United States, the key value-add comes in knowing how to exploit a reader’s emotions, biases, preconceptions, and other lizard-brain qualities that can make a lie really hit home. Baiting that hook remains something an algorithm can reliably do. And it’s not as if “lack of realistic writing in grafs 3 through 12” was a real problem limiting most misinformation campaigns.

But I can see some more realistic impacts here. This quality of generated text could allow you to create a website will what appear to be fully fleshed out archives — pages and pages of cogent text going back years — which might make it seem more legitimate than something more obviously thrown together.

GPT-2’s relative mastery of English could give foreign disinformation campaigns a more authentic sounding voice than whatever the B-team at the Internet Research Agency can produce from watching Parks & Rec reruns.

And the key talent of just about any algorithm is scale — the ability to do something in mass quantities that no team of humans could achieve. As Larry Lessig wrote in 2009 (and Philip Bump reminded us of this week), there’s something about a massive data dump that especially encourages the cherry-picking of facts (“facts”) to support one’s own narrative. Here’s Bump:

In October 2009, he wrote an essay for the New Republic called “Against Transparency,” a provocative title for an insightful assessment of what the Internet would yield. Lessig’s argument was that releasing massive amounts of information onto the Internet for anyone to peruse — a big cache of text messages, for example — would allow people to pick out things that reinforced their own biases…

Lessig’s thesis is summarized in two sentences. “The ‘naked transparency movement’…is not going to inspire change,” he wrote. “It will simply push any faith in our political system over the cliff”…

That power was revealed fully in the 2016 election by one of the targets of the Russia probe: WikiLeaks. The group obtained information stolen by Russian hackers from the Democratic National Committee and Hillary Clinton’s campaign chairman, John Podesta…In October, WikiLeaks slowly released emails from Podesta…Each day’s releases spawned the same cycle over and over. Journalists picked through what had come out, with novelty often trumping newsworthiness in what was immediately shared over social media. Activists did the same surveys, seizing on suggestive (if ultimately meaningless) items. They then often pressured the media to cover the stories, and were occasionally successful…

People’s “responses to information are inseparable from their interests, desires, resources, cognitive capacities, and social contexts,” Lessig wrote, quoting from a book called “Full Disclosure.” “Owing to these and other factors, people may ignore information, or misunderstand it, or misuse it.”

If you wanted to create something as massive as a fake cache of hacked emails, GPT-2 would be of legitimate help — at least as a starting point, producing something that could then be fine-tuned by humans.

The key fact of the Internet is that there’s so much of it. Too much of it for anyone to have a coherent view. If democracy requires a shared set of facts — facts traditionally supplied by professional journalists — the ability to flood the zone with alternative facts could take the bot infestation of Twitter and push it out to the broader world.

Illustration by Zypsy ✪ used under a Creative Commons license.

Joshua Benton is the senior writer and former director of Nieman Lab. You can reach him via email (joshua_benton@harvard.edu) or Twitter DM (@jbenton).

POSTED Nov. 7, 2019, 11:12 a.m.

Show tags

TWITTER FACEBOOK EMAIL