User talk:Newyorkbrad#Clarification on date delinking injunction

Templates on my articles

[edit]

Question for discussion

[edit]

Many pages on Wikipedia, including drafts, deletion discussions, and the like, are not supposed to be indexed by search engines. My impression is that we frequently justify maintaining borderline material on the site, such as drafts and other discussions about living persons that would not be fit for mainspace, by pointing out that "it won't cause any harm because it won't show up in Google searches and the like." Indeed, in the past I have advocated increased usage of "NOINDEX" designations for this specific reason.

However, I have read in several places (a Wikipediocracy thread is the most recent) that AI bots that now routinely scrape Wikipedia often disregard robots.txt and similar designations and scrape every page. The contents of those pages, including those regarded as "not ready for prime time," are then impounded into AI databases and are at risk of being regurgitated as fact in later queries to the AI programs.

If this is the case, what implications does this have for our policies and operations?

I am interested in a broad discussion of this issue, but given my non-technical background, am asking here first to test whether my basic understandings/assumptions are correct, as well as whether this issue has been (or is being) discussed already. I would appreciate hearing from anyone with information on this. Thank you, Newyorkbrad (talk) 15:15, 2 April 2025 (UTC)[reply]

I am a professional in the AI space, read books on this, and go to conferences. I am on the social side of AI and not the technical side.
I have a general recommendation for comprehending AI trends, which I think has made accurate and actionable predictions to this point. When predicting trends in a well-funded tech direction, do not worry about the AI having bias, making errors, misunderstanding anything, or having shortcomings in the future which are visible as problems today. Instead, your worry should be about a near future when the AI does everything with transcendent perfection. Applied to this case, the concern should be of the AI understanding the exact extent of how NOINDEX pages are different from other Wikimedia content, as well as understanding all other available data. Bluerasberry (talk) 15:40, 2 April 2025 (UTC)[reply]
Thanks for the reply. I think the concern is that some AI programs and programmers may not care about how or why NOINDEX or robots.txt pages are different from other Wikipedia content, but may just be treating pages as data indiscriminately. Newyorkbrad (talk) 15:52, 2 April 2025 (UTC)[reply]
Hi! Someone alerted me to this because I was puzzling over why we have, essentially, two things that do the same thing (draftspace and userspace), but one seems to waste more editor-hours on busywork. They suggested I put my thoughts down, which I sketched out sorta here User:Tduk/Draftspace. Anyway in a nutshell, if this might be a motivating factor to reform draftspace and actually make it useful to new editors, I’d support it! In response to this direct question, I’d wonder what the impact is on AI of makiing user/draftspace articles vs just making public web pages. Tduk (talk) 19:49, 2 April 2025 (UTC)[reply]
When predicting trends in a well-funded tech direction, do not worry about the AI having bias, making errors, misunderstanding anything, or having shortcomings in the future which are visible as problems today. Whether or not we are concerned that LLMs will continue to have issues with bias/hallucination in the future, given that we know that current iterations of the technology does have those issues (and that many people seem to give text generated by LLM significantly more trust than I think the evidence suggests that it currently deserves) we should be concerned about the effects that LLM use of draftspace etc. is having now. When google was putting AI generated results at the top of all their searches I regularly found that it gave information which I knew for a fact was wrong.
Personally I think that current evidence does not suggest that anything approaching an AI which does everything with transcendent perfection is coming in the near future, but even if it is we still have an indeterminate amount of time now where the nearest thing we have to general AI is LLMs which are not close to transcendent perfection. Caeciliusinhorto-public (talk) 08:32, 3 April 2025 (UTC)[reply]
"[W]hat implications does this have for our policies and operations?" I'd have to suggest that perhaps one of the most obvious ones is that, both from an ethical perspective and possibly a legal one, Wikipedia needs to consider enforcing the 'other pages, including talk pages' provisions within existing WP:BLP policy more strictly. If the bullshit-bots are scraping such content and potentially regurgitating it, and we are aware of the fact, we can't just pretend it isn't happening. AndyTheGrump (talk) 21:26, 2 April 2025 (UTC)[reply]
This makes sense - I'd worry also if people started to become aware of when AI bots did their scraping - they could sneak in vandalism right before the scrape, so that even if it was reverted pretty quickly, it may get in. It seems like in all the internet wars so far involving scraping vs fake content, the fake content has been winning. Tduk (talk) 21:34, 2 April 2025 (UTC)[reply]
It's a shame that the original report came in a Diff post and had a focus on Infrastructure, as the implications probably need supplementary information on crawler workload, to draw out implications on, for example:
  • Draft-space content being consumed for regurgitation;
  • Talk-page and User-Talk-page content being consumed, whether to better simulate human language or to regurgitate discussion points/claims as facts.
Maybe requests for non-mainspace content could be throttled to one per 10s - wouldn't impact a real user but would detriment a crawler?
And outside en.wiki, this changing workload may increase the case for stricter validation of content being added to small-language wikis lacking many-eyes oversight, such as the cases of the Scots wiki and the Greenlandic wiki, to avoid unidentified poor translation-tool text being consumed by crawlers and then becoming part of the language. AllyD (talk) 08:37, 5 April 2025 (UTC)[reply]
I'd strongly challenge Maybe requests for non-mainspace content could be throttled to one per 10s - wouldn't impact a real user—skimming between 100+ near-identical shots on Commons of the same building to try to decide whicb one best illustrates a particular architectural element (for example) isn't at all an unusual situation.
(Personally, if pressed I'd guess this is an issue which will resolve itself fairly soon. The AI bubble is almost certainly reaching its bursting point, and whichever two or three systems survive the bust will presumably soon have completed their bulk downloads and will just periodically check for new additions from them no. We went through the same thing 20 years ago when every AltaVista, HotBot and AskJeeves was constantly downloading text dumps, and we survived without any obvious problems. For reasons NYB knows well, I have a high degree of scepticism whenever the WMF comes out with any claimed problem to which the answer is "we need more money".)
My broader thoughts on the AI scraping issue are here, to avoid cluttering NYB's talkpage with what's essentially a lengthy rambling aside. ‑ Iridescent 16:35, 6 April 2025 (UTC)[reply]

I apologize for coming to this discussion late, but I believe that people might be interested in a little experiment I performed with ChatGPT.

I've been struggling with writing an article about a surprisingly overlooked battle of WWII. The reasons I've had problems writing it are not relevant, although I have the article outlined. However, I thought one way to overcome these problems would be to use ChatGPT to turn my outline into an article, so I fed the AI program part of the outline to see what kind of article it would produce. (At least I would not have to worry about the software hallucinating facts & sources.) The result was... underwhelming. ChatGPT's article was full of cringe. (That's been the result of a lot of my experiments with using ChatGPT to write fiction.) Despite providing all of the needed facts, ChatGPT introduced a number of errors -- e.g., ChatGPT assumed that "86th Mountain Infantry" was a division & not a regiment, & labeled it as a division. Lastly, although I had provided ample references, ChatGPT decided to omit all of them. I ended up having to drastically rewrite the AI output to make it acceptable for Wikipedia.

In short, I would be very wary of any AI-generated content. The content would be so unreliable that it might be simpler, if the article was on a notable subject, to simply replace it with a stub -- or delete it.

Then again, this technology is evolving so fast that another AI could take an outline & produce a usable Wikipedia article. -- llywrch (talk) 22:13, 12 May 2025 (UTC)[reply]

Take a look at Shit flow diagram and Malacca dilemma. The results aren't stellar, but it can be used functionally. There's still a significant time investment, because I read the sources and checked every statement while converting references to sfn. It can't be trusted to work without oversight, but it's getting better. ScottishFinnishRadish (talk) 22:33, 12 May 2025 (UTC)[reply]
I just became aware that there is: Wikipedia:WikiProject AI Cleanup. --Tryptofish (talk) 22:17, 22 May 2025 (UTC)[reply]

June 11: Virtual NYC WikiWednesday

[edit]
June 11: WikiWednesday Salon (Virtual)

You are invited to join the Wikimedia NYC community for our virtual WikiWednesday Salon. This month's WikiWednesday will be fully online and focused on Wikimedia global trends, neutral point of view, and the Wiki Loves Pride campaign for Pride Month. No experience of anything at all is required. All are welcome!

Meeting info:

All attendees at Wikimedia NYC events are subject to the Wikimedia NYC Code of Conduct and Photography Policy.

(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

--Wikimedia New York City Team via MediaWiki message delivery (talk) 14:05, 5 June 2025 (UTC)[reply]

Your ARC comment

[edit]

I expected better from you. Surely you meant "In case this is accepted ..."? RoySmith (talk) 15:03, 2 July 2025 (UTC)[reply]

The odds of any more good puns here are minuscule. Regards, Newyorkbrad (talk) 15:38, 2 July 2025 (UTC)[reply]
Be bold, and you'll find some. --SarekOfVulcan (talk) 15:41, 2 July 2025 (UTC)[reply]

Regarding your closing comment on the Wikipedia:What’s a forint? MfD page

[edit]

@Newyorkbrad I was looking on the MfD page for my silly and absurd essay, and I saw you wrote as if I were a female (as in “she” and “her” pronouns being used). I am actually a male, and I would’ve preferred it if you either used he/him or even just generally used they/it. I’m not mad with you, and I can see where you got “she” and “her” from (my username). Melissza’s page Have a talk! My contributions 18:37, 8 July 2025 (UTC)[reply]

Also, I don’t really plan on recovering the page. I understand your privacy concerns, and in my opinion, this “humorous” essay was terrible (I mean, 5 sixths of it was straight up copy-editing the page for the Hungarian forint and plain waffle) and I would like to just forget this essay until I feel comfortable enough to return to it. Melissza’s page Have a talk! My contributions 18:45, 8 July 2025 (UTC)[reply]
@Melissza1692: Apologies for the error with the pronouns. I will go to the closing statement now and fix them. Thank you for understanding the reasons for my closing, and good luck with your future editing. Newyorkbrad (talk) 20:57, 8 July 2025 (UTC)[reply]

WikiNYC this week: Thursday Edit-a-thon + Sunday Wiki-Picnic!

[edit]
Two Wikimedia NYC 400 events

Please join us for the launch events to recognize the Wikimedia NYC 400!

Fulfilling Wikimedia NYC tradition, we'll start off the campaign with an edit-a-thon on Thursday and a Wiknic on Sunday, and will continue with Wikimedia NYC 400 events throughout the rest of this year.

Event details and registration link below:

When: July 17, 2025
Time: 5pm to 8pm (8pm to 10pm optional hackathon)
Where: In person at Prime Produce (424 W 54th St, New York, NY 10019)
When: July 20, 2025
Time: 12pm to 4pm
Where: In person in NYC at Washington Square Park

All attendees at Wikimedia NYC events are subject to the Wikimedia NYC Code of Conduct and Photography Policy.

(You can subscribe/unsubscribe from future notifications for NYC-area events by adding or removing your name from this list.)

--Wikimedia New York City Team via MediaWiki message delivery (talk) 02:53, 14 July 2025 (UTC)[reply]