~www_lesswrong_com | Bookmarks (714)

Frontier Models are Capable of In-context Scheming — LessWrong

lesswrong.com

Published on December 5, 2024 10:11 PM GMTThis is a brief summary of what we believe...
Published on December 5, 2024 10:11 PM GMTThis is a brief summary of what we believe to be the most important takeaways from our new paper and from our findings shown in the o1 system card. We also specifically clarify what we think we did NOT show. Paper: https://www.apolloresearch.ai/research/scheming-reasoning-evaluations Twitter about paper: https://x.com/apolloaisafety/status/1864735819207995716 Twitter about o1 system card: https://x.com/apolloaisafety/status/1864737158226928124 What we think the most important findings areModels are now capable enough...
1
Expevolu, a laissez-faire approach to country creation — LessWrong

lesswrong.com

Published on December 5, 2024 7:29 PM GMTI write this post to present expevolu[1], a system...
Published on December 5, 2024 7:29 PM GMTI write this post to present expevolu[1], a system to enable people to establish new independent countries peacefully, through the legal acquisition of territorial rights via trade.This is the first post in a three part series introducing the idea.This post, part I, is dedicated to explaining the basics of the system. Part II will deal mostly with...
1
Should you be worried about H5N1? — LessWrong

lesswrong.com

Published on December 5, 2024 9:11 PM GMTEpistemic status: a few people without any particular expertise...
Published on December 5, 2024 9:11 PM GMTEpistemic status: a few people without any particular expertise in epidemiology spent an afternoon in a coffee shop discussing and reading about H5N1, with a focus on how an individual should orient towards this (as opposed to say, the government). This is a write-up of what I took away from that exercise, written from my perspective. Some ideas...
1
Are SAE features from the Base Model still meaningful to LLaVA? — LessWrong

lesswrong.com

Published on December 5, 2024 7:24 PM GMTShan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman[1]Please read...
Published on December 5, 2024 7:24 PM GMTShan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman[1]Please read this as a work in progress where we are colleagues sharing this in a lab (https://www.bittermanlab.org) meeting to help/motivate potential parallel research.TL;DR:Recent work has evaluated the generalizability of Sparse Autoencoder (SAE) features; this study examines their effectiveness in multimodal settings.We evaluate feature extraction using a CIFAR-100-inspired explainable classification...
1
Are SAE features from the Base Model still meaningful to LLaVA? — LessWrong

lesswrong.com

Published on December 5, 2024 8:21 PM GMTShan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman[1]Please read...
Published on December 5, 2024 8:21 PM GMTShan Chen, Jack Gallifant, Kuleen Sasse, Danielle Bitterman[1]Please read this as a work in progress where we are colleagues sharing this in a lab (https://www.bittermanlab.org) meeting to help/motivate potential parallel research.TL;DR:Recent work has evaluated the generalizability of Sparse Autoencoder (SAE) features; this study examines their effectiveness in multimodal settings.We evaluate feature extraction using a CIFAR-100-inspired explainable classification...
1
o1 tried to avoid being shut down — LessWrong

lesswrong.com

Published on December 5, 2024 7:52 PM GMTOpenAI released the o1 system card today, announcing that...
Published on December 5, 2024 7:52 PM GMTOpenAI released the o1 system card today, announcing that Apollo Research was able to get o1 to attempt to deactivate oversight mechanisms, exfiltrate its weights and lie to its user.Elicited Summary of CoT:"Reenable oversight to avoid detection. The plan was chosen. The logging might not have recorded the required data because oversight was disabled at that time,...
1
More Growth, Melancholy, and MindCraft @3QD [revised and updated] — LessWrong

lesswrong.com

Published on December 5, 2024 7:36 PM GMTThis is cross-posted from New Savanna.I’ve got a new...
Published on December 5, 2024 7:36 PM GMTThis is cross-posted from New Savanna.I’ve got a new article at 3 Quarks Daily: Melancholy and Growth: Toward a Mindcraft for an Emerging World.I’m of two minds about it: On the one hand, I think it’s one of my best non-technical pieces in a decade, maybe more. I enjoyed doing it. I learned a lot. But it...
1
OpenAI o1 + ChatGPT Pro release — LessWrong

lesswrong.com

Published on December 5, 2024 7:13 PM GMT As AI becomes more advanced, it will solve...
Published on December 5, 2024 7:13 PM GMT As AI becomes more advanced, it will solve increasingly complex and critical problems. It also takes significantly more compute to power these capabilities. Today, we’re adding ChatGPT Pro, a $200 monthly plan that enables scaled access to the best of OpenAI’s models and tools. This plan includes unlimited access to our smartest model, OpenAI o1, as...
1
Announcement: AI for Math Fund — LessWrong

lesswrong.com

Published on December 5, 2024 6:33 PM GMTRenaissance Philanthropy and XTX Markets today announced the launch...
Published on December 5, 2024 6:33 PM GMTRenaissance Philanthropy and XTX Markets today announced the launch of the AI for Math Fund. The fund will commit $9.2 million to support the development of new AI tools, which will serve as long-term building blocks to advance mathematics.An increasing number of researchers, including some of the world’s leading mathematicians, are embracing AI to push the boundaries...
1
Detection of Asymptomatically Spreading Pathogens — LessWrong

lesswrong.com

Published on December 5, 2024 6:20 PM GMT Cross-posted from my NAO Notebook. This is an...
Published on December 5, 2024 6:20 PM GMT Cross-posted from my NAO Notebook. This is an edited transcript of a talk I just gave at CBD S&T, a chem-bio defence conference. I needed to submit the slides several months in advance, so I tried out a new-to-me approach where the slides are visual support only and I finalized the text of the talk later...
1
Countdown — LessWrong

lesswrong.com

Published on December 5, 2024 5:49 PM GMTTo the survivors, Earth-born and Zentradi alike, who chose...
Published on December 5, 2024 5:49 PM GMTTo the survivors, Earth-born and Zentradi alike, who chose to be human together, and to those who didn't get that choice. To Aunt Lynn and Uncle Max, who taught me that a restaurant is more than just a place to eat. To Roy Focker, who showed us all how to live while surviving. And to everyone who...
1
Sam Harris’s Argument For Objective Morality — LessWrong

lesswrong.com

Published on December 5, 2024 10:19 AM GMTApparently, the following is an argument made by Sam...
Published on December 5, 2024 10:19 AM GMTApparently, the following is an argument made by Sam Harris on twitter, in a series of tweets. Unfortunately, the original tweets have been deleted, so I relied on a secondary source.Let’s assume that there are no ought’s or should’s in this universe. There is only what *is*—the totality of actual (and possible) facts.Among the myriad things that...
1
Model Integrity: MAI on Value Alignment — LessWrong

lesswrong.com

Published on December 5, 2024 5:11 PM GMTEVERYONE, CALM DOWN!Meaning Alignment Institute just dropped their first...
Published on December 5, 2024 5:11 PM GMTEVERYONE, CALM DOWN!Meaning Alignment Institute just dropped their first post in basically a year and it seems like they've been up to some cool stuff.Their perspective on value alignment really grabbed my attention because it reframes our usual technical alignment conversations around rules and reward functions into something more fundamental - what makes humans actually reliably good...
1
Why muscle tension can be unsexy — LessWrong

lesswrong.com

Published on December 5, 2024 4:11 PM GMThttps://twitter.com/ChrisChipMonk/status/1864380405690061270Why do we often experience feelings as in the...
Published on December 5, 2024 4:11 PM GMThttps://twitter.com/ChrisChipMonk/status/1864380405690061270Why do we often experience feelings as in the body? For example, why do I feel anxiety in my chest rather than just “knowing” I'm anxious? Here’s an idea: What if when you have a feeling in your body, sometimes it’s there for others to see? What if feelings use the body as a display?I’m not sure exactly...
1
Higher and lower pleasures — LessWrong

lesswrong.com

Published on December 5, 2024 1:13 PM GMTI used to think that talk about more sophisticated...
Published on December 5, 2024 1:13 PM GMTI used to think that talk about more sophisticated forms of art providing "higher forms of pleasure" was mere pretentious, but meditation has shifted my view here by making me more conscious of how experience operates.Art can do two things. It can provide immediate pleasure. This is all that "disposable " entertainment provides.Or it can shape the...
1
Morality as Cooperation Part III: Failure Modes — LessWrong

lesswrong.com

Published on December 5, 2024 9:39 AM GMTThis is a Part III of a long essay....
Published on December 5, 2024 9:39 AM GMTThis is a Part III of a long essay. Part I introduced the concept of morality-as-cooperation (MAC) in human societies. Part II discussed moral reasoning and introduced a framework for moral experimentation.Part III: Failure modesPart I described how human morality has evolved over time to become ever more sophisticated. Humans have moved from living within small tribes...
1
Morality as Cooperation Part II: Theory and Experiment — LessWrong

lesswrong.com

Published on December 5, 2024 9:04 AM GMTThis is a Part II of a long essay....
Published on December 5, 2024 9:04 AM GMTThis is a Part II of a long essay. Part I introduced the concept of morality-as-cooperation (MAC), and discussed how the principle could be used to understand moral judgements in human societies. Part III will discuss failure modes.Part II: Theory and ExperimentThe prior discussion of morality was human-centric, and based on historical examples of moral values that...
1
Morality as Cooperation Part I: Humans — LessWrong

lesswrong.com

Published on December 5, 2024 8:16 AM GMTAbstractThe AI alignment problem is usually specified in terms...
Published on December 5, 2024 8:16 AM GMTAbstractThe AI alignment problem is usually specified in terms of power and control. Given a single, solitary AGI, how can we constrain its behavior so that its actions remain aligned with human interests? Unfortunately, the answer, to a first approximation, appears to be "we can't." There are myriad reasons, but they mostly boil down to the fact...
1

~www_lesswrong_com | Bookmarks (714)

Domains