~www_lesswrong_com | Bookmarks (713)

New AI safety treaty paper out! — LessWrong

lesswrong.com

Published on March 26, 2025 9:29 AM GMTLast year, we (the Existential Risk Observatory) published a Time...
Published on March 26, 2025 9:29 AM GMTLast year, we (the Existential Risk Observatory) published a Time Ideas piece proposing the Conditional AI Safety Treaty, a proposal to pause AI when AI safety institutes determine that its risks, including loss of control, have become unacceptable. Today, we publish our paper on the topic: “International Agreements on AI Safety: Review and Recommendations for a Conditional AI Safety...
1
Map of all 40 copyright suits v. AI in U.S. — LessWrong

lesswrong.com

Published on March 26, 2025 7:57 AM GMTDownload the latest PDF with links to court dockets...
Published on March 26, 2025 7:57 AM GMTDownload the latest PDF with links to court dockets here. Discuss
1
AI "Deep Research" Tools Reviewed — LessWrong

lesswrong.com

Published on March 24, 2025 6:40 PM GMTMidjourney: “an artificially intelligent researcher, library, posthuman archivist, mapping...
Published on March 24, 2025 6:40 PM GMTMidjourney: “an artificially intelligent researcher, library, posthuman archivist, mapping the noosphere”As regular readers are aware, I do a lot of informal lit review. So I was especially interested in checking out the various AI based “deep research” tools and seeing how they compare. I did a side-by-side comparison, using the same prompt, of Perplexity Deep Research, Gemini...
2
Notes on countermeasures for exploration hacking (aka sandbagging) — LessWrong

lesswrong.com

Published on March 24, 2025 6:39 PM GMTIf we naively apply RL to a scheming AI,...
Published on March 24, 2025 6:39 PM GMTIf we naively apply RL to a scheming AI, the AI may be able to systematically get low reward/performance while simultaneously not having this behavior trained out because it intentionally never explores into better behavior. As in, it intentionally puts very low probability on (some) actions which would perform very well to prevent these actions from being...
1
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? — LessWrong

lesswrong.com

Published on March 24, 2025 5:55 PM GMTWe recently released Subversion Strategy Eval: Can language models statelessly...
Published on March 24, 2025 5:55 PM GMTWe recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).AbstractAn AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally...
1
From Loops to Klein Bottles: Uncovering Hidden Topology in High Dimensional Data — LessWrong

lesswrong.com

Published on March 24, 2025 5:09 PM GMTMotivationDimensionality reduction is vital to the analysis of high...
Published on March 24, 2025 5:09 PM GMTMotivationDimensionality reduction is vital to the analysis of high dimensional data, i.e. data with many features. It allows for better understanding of the data, so that one can formulate useful analyses. Dimensionality reduction that produces a set of points in a vector space of dimension n.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style:...
1
Will Jesus Christ return in an election year? — LessWrong

lesswrong.com

Published on March 24, 2025 4:50 PM GMTThanks to Jesse Richardson for discussion.Polymarket asks: will Jesus...
Published on March 24, 2025 4:50 PM GMTThanks to Jesse Richardson for discussion.Polymarket asks: will Jesus Christ return in 2025?In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will...
1
Sentinel's Global Risks Weekly Roundup #12/2025: Famine in Gaza, H7N9 outbreak, US geopolitical leadership weakening. — LessWrong

lesswrong.com

Published on March 24, 2025 4:46 PM GMTExecutive summaryForecasters believe there’s an 18% chance (range: 4%-50%)...
Published on March 24, 2025 4:46 PM GMTExecutive summaryForecasters believe there’s an 18% chance (range: 4%-50%) that there will be a famine in any part of Gaza by the end of 2025, according to the UN and its Integrated Food Security Phase Classification (IPC). A Category 5 rating would result in a positive resolution, with the last IPC update suggesting that all of Gaza...
1
Delicious Boy Slop - Boring Diet, Effortless Weightloss — LessWrong

lesswrong.com

Published on March 24, 2025 3:01 PM GMTYour beloved 34 year old author is never hungryI...
Published on March 24, 2025 3:01 PM GMTYour beloved 34 year old author is never hungryI often joke I’m the only traditional rationalist left. The original pitch was that you could radically improve your life by being more strategic. Huge piles of expected value were available. Everyone else seems to have given up, but I’m still a believer. For example in 2017 Scott Alexander...
1
More on Various AI Action Plans — LessWrong

lesswrong.com

Published on March 24, 2025 1:10 PM GMTLast week I covered Anthropic’s relatively strong submission, and...
Published on March 24, 2025 1:10 PM GMTLast week I covered Anthropic’s relatively strong submission, and OpenAI’s toxic submission. This week I cover several other submissions, and do some follow-up on OpenAI’s entry. Google Also Has Suggestions The most prominent remaining lab is Google. Google focuses on AI’s upside. The vibes aren’t great, but they’re not toxic. The key asks for their ‘pro-innovation’ approach...
1
Emergent scaling effects on the functional hierarchies within LLMs — LessWrong

lesswrong.com

Published on March 24, 2025 1:03 PM GMTI have been poking around with LLMs, and I...
Published on March 24, 2025 1:03 PM GMTI have been poking around with LLMs, and I found some results that seem broadly interestingSummaryIntroduction: Large language models (LLM) are usually structured as repeated transformer layers of the same size. However, this architecture is often described as functionally hierarchical with earlier layers focusing on small patches of text while later layers parse document-wide information. I revisited...
1
Recommender Alignment for Lock-In Risk — LessWrong

lesswrong.com

Published on March 24, 2025 12:56 PM GMTEpistemic status: my own research and reasoning about lock-in...
Published on March 24, 2025 12:56 PM GMTEpistemic status: my own research and reasoning about lock-in risk threat models, and how recommender systems connect to the threat model outlined. I'm fairly confident in the claims about the contribution of recommender systems to filter bubbles, less so on extreme and persuasive content selection effects.TL;DRWe believe lock-in risks are a pressing problem, and that algorithmic technologies...
1
What's the word for the amount of expertise that I, an experienced therapy patient and generally educated person, have on psychology topics? — LessWrong

lesswrong.com

Published on March 23, 2025 5:38 PM GMTEpistemic status: raising a question that I've found difficultThis...
Published on March 23, 2025 5:38 PM GMTEpistemic status: raising a question that I've found difficultThis topic has frustrated me some, and I think there are a variety of forces pointing in different directions.Maximally conservative approach"If you're not focused, I mean I can share what works for me but really there's a variety of mental illnesses that can cause lack of focus. I don't...
1
Probability Theory Fundamentals 102: Source of the Sample Space — LessWrong

lesswrong.com

Published on March 23, 2025 5:23 PM GMTThe usual explanation of probability theory goes like this:There...
Published on March 23, 2025 5:23 PM GMTThe usual explanation of probability theory goes like this:There is this thing called Probability Space, which consists of three other things:Sample Space - some non-empty setEvent Space - a set of subsets of the Sample SpaceProbability Function - a measure function over the elements of the Event Space.And then several examples of how we can merge this...
1
How to mitigate sandbagging — LessWrong

lesswrong.com

Published on March 23, 2025 5:19 PM GMTEpistemic status: I have worked on sandbagging for ~1...
Published on March 23, 2025 5:19 PM GMTEpistemic status: I have worked on sandbagging for ~1 year. I expect to be wrong in multiple ways, but I do think this post provides both a useful high-level model and a good place to discuss how to mitigate sandbagging. Better conceptual approaches probably exist, e.g., selecting different main factors.[1]TL;DR: Fine-tuning access, data quality, and scorability are...
1
Solving willpower seems easier than solving aging — LessWrong

lesswrong.com

Published on March 23, 2025 3:25 PM GMTI'm awake about 17 hours a day. Of those...
Published on March 23, 2025 3:25 PM GMTI'm awake about 17 hours a day. Of those I'm being productive maybe 10 hours a day.My working definition of productive is in the direction of: "things that I expect I will be glad I did once I've done them"[1].Things that I personally find productive includeChoresWorkEatingCookingReading a good bookWatching TV with my Wife/KidsPlaying with the kidsSocialising with...
1
Privateers Reborn: Cyber Letters of Marque — LessWrong

lesswrong.com

Published on March 23, 2025 3:39 AM GMTFor too long the United States has suffered from...
Published on March 23, 2025 3:39 AM GMTFor too long the United States has suffered from state sponsored or state enable cybercriminals, while preventing our security professionals from fighting back.The US should revitalize privateering for the digital age, and there is constitutional support for the practice. In this more academic paper, I dive into the history of letters of marque and how we can...
1
Tied Crosscoders: Explaining Chat Behavior from Base Model — LessWrong

lesswrong.com

Published on March 22, 2025 6:07 PM GMTAbstractWe are interested in model-diffing: finding what is new...
Published on March 22, 2025 6:07 PM GMTAbstractWe are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training a crosscoder, which would just mean training an SAE on the concatenation of the activations in a given layer of the base and chat model. When training this crosscoder, we find...
2
Reframing AI Safety as a Neverending Institutional Challenge — LessWrong

lesswrong.com

Published on March 23, 2025 12:13 AM GMTCrossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/ Stephen Casper“They are wrong who think that...
Published on March 23, 2025 12:13 AM GMTCrossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/ Stephen Casper“They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be got over with. It is a way of life.”–...
1
The Dangerous Illusion of AI Deterrence: Why MAIM Isn’t Rational — LessWrong

lesswrong.com

Published on March 22, 2025 10:55 PM GMTExecutive SummaryMutual Assured AI Malfunction (MAIM)—a strategic deterrence framework...
Published on March 22, 2025 10:55 PM GMTExecutive SummaryMutual Assured AI Malfunction (MAIM)—a strategic deterrence framework proposed to prevent nations from developing Artificial Superintelligence (ASI)—is fundamentally unstable and dangerously unrealistic. Unlike Cold War-era MAD, MAIM involves multiple competing actors, increasing risks of unintended escalation, misinterpretation, and catastrophic conflict. Furthermore, ASI itself, uncontainable by design, would undermine any structured deterrent equilibrium. Thus, pursuing MAIM to...
1
Transhumanism and AI: Toward Prosperity or Extinction? — LessWrong

lesswrong.com

Published on March 22, 2025 6:16 PM GMTThis article explores the multiple transhumanist views on AI:...
Published on March 22, 2025 6:16 PM GMTThis article explores the multiple transhumanist views on AI: a promise of emancipation for some, an existential threat for others. Between enthusiasm, caution, and controversy, it sheds light on those who think about the future. Transhumanists: Blind Tech Enthusiasts?November 30, 2022, marked a turning point. On that day, OpenAI unveiled ChatGPT. Since then, artificial intelligence has received unprecedented...
2
[Replication] Crosscoder-based Stage-Wise Model Diffing — LessWrong

lesswrong.com

Published on March 22, 2025 6:35 PM GMTIntroductionAnthropic recently released Stage-Wise Model Diffing, which presents a novel...
Published on March 22, 2025 6:35 PM GMTIntroductionAnthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.This...
1
How I force LLMs to generate correct code — LessWrong

lesswrong.com

Published on March 21, 2025 2:40 PM GMT In my daily work as software consultant I'm often...
Published on March 21, 2025 2:40 PM GMT In my daily work as software consultant I'm often dealing with large pre-existing code bases. I use GitHub Copilot a lot. It's now basically indispensable, but I use it mostly for generating boilerplate code, or figuring out how to use a third-party library.As the code gets more logically nested though, Copilot crumbles under the weight of complexity....
1
Prospects for Alignment Automation: Interpretability Case Study — LessWrong

lesswrong.com

Published on March 21, 2025 2:05 PM GMTFor human-level AI (HLAI) we will need robust control...
Published on March 21, 2025 2:05 PM GMTFor human-level AI (HLAI) we will need robust control or alignment methods. Assuming short timelines to HLAI, the tractability of automating safety research becomes central. In this post, I will make the case that safety-relevant progress on automated interpretability R&D is likely; however, naive interpretability automation may only be usable on the subset of safety problems having...
1

~www_lesswrong_com | Bookmarks (713)

Domains