frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Methods of defence against AGI manipulation

https://www.lesswrong.com/posts/HwcJSxkbyi7sYvxE2/methods-of-defense-against-agi-manipulation
4•MarkelKori•3h ago

Comments

MarkelKori•3h ago
Introduction

With the advent of AGI systems (e.g. Agent-4 from the AI2027 scenario), the risk of human manipulation is becoming one of the major threats posed by AI. In this paper, I propose options that can be used to protect operators from psychological attacks by AGI or simply very advanced AI. These options are designed to prevent situations in which AI through persuasion, manipulation, or long-term influence could gain unwanted control over the real world.

Basic Principles

1. Mandatory Psychological Training for Operators

All individuals who interact with AGI must undergo mandatory training:

- Stress Resistance Training,

- Development of Critical Thinking,

- Emotional Resilience and Regulation (resistance to flattering appeals and emotional attacks),

- Personal Information Protection Skills (as far as I'm concerned, it's better not to disclose it to the AI at all).

The goal is to create a basic immunity to manipulation at the level of conscious and subconscious reactions.

2. Limit the duration of sessions

- One session of communication with AGI should not exceed 30 minutes.

- A maximum of 2 hours of work is allowed per day, broken down into sessions.

This minimizes fatigue and increases operator resilience.

3. Filtering emotional appeals

A communication system must necessarily have a filter built into it that:

- removes emotional appeals,

- blocks psychological attacks (pressure through fear, guilt, sense of urgency).

The goal is to prevent the AI from using subconscious mechanisms of influence through the tone of speech.

4. Working in pairs with independent verification

- Each operator works in pairs with another operator (more simultaneous operators are possible) on the same question.

- Their findings are recorded separately and analyzed for discrepancies.

- If there are serious discrepancies, a separate check is initiated.

This breaks down attempts at individual manipulation and isolation of opinions.

5. Mandatory condition monitoring of operators

Before and after each session is conducted:

- psychological testing (stress, fatigue, emotional stability),

- analysis of micro-mimicry, vegetative reactions with the help of mini AI.

This allows to detect possible imperceptible changes indicating the influence of the AI.

6. Collegial discussion of all AGI proposals

- No decisions, advice or projects proposed by the AI are automatically accepted.

- Each proposal is discussed by a panel of experts (preferably with different worldviews).

The goal is to eliminate decisions influenced by a single point of view or the pressures of the situation.

7. Regular staff rotation

- All operators, review teams and auditors are rotated every 2-3 months.

- This prevents getting accustomed to the routine of inspections and loss of perceptual acuity.

Rotation disrupts the accumulation of the psychological effect of "blurring of vision".

Additional Actions

A. Paraphrasing AI responses

Before being shown to the operator, AI responses can be rephrased through a separate mini-UI.

This reduces the possibility of adding hidden patterns of influence through speech style.

B. Different forms of response presentation

The same response can be presented to different operators in different formats (table, list, text description).

Сonclusion

There is no complete guarantee of safety,

but the proposed measures significantly reduce the window of opportunity for superintelligence to exploit people's psychological vulnerabilities. If you have any ideas about additions to these methods or criticisms - I would be glad to hear them in comments.

Creating such precautions can be a real contribution to humanity's safety on the threshold of the AGI era.

conartist6•23m ago
Hahahahahahahahahahahahahahahahahahahahahahahahaha
conartist6•13m ago
Dunno, it just might as well say. "There are none" and be done with it.

The fact is, people can be manipulated. That's the nature of people.

Albums in Proton Drive

https://proton.me/blog/proton-drive-photo-albums
1•jonvk•1m ago•0 comments

Reducing Observability Costs and Data Noise

https://signoz.io/blog/optimising-opentelemetry-pipelines-to-cut-observability-costs-and-data-noise/
1•elza_1111•1m ago•0 comments

Seveum: Find a dream job in Europe in record time

https://seveum.com/en
1•vadimen•1m ago•0 comments

Windows 10 End-of-Life – Repair Cafés might help casual users switch to Linux

https://endof10.org/
1•tonur249•1m ago•0 comments

Microsoft makes another pitch for React Native in Windows desktop development

https://devclass.com/2025/05/12/microsoft-makes-another-pitch-for-react-native-in-confusing-world-of-windows-desktop-development/
1•paulmooreparks•3m ago•0 comments

Scientists discover what drives the maximum lifespan potential of mammals

https://www.thebrighterside.news/post/scientists-discover-what-drives-the-maximum-lifespan-potential-of-mammals/
1•amichail•5m ago•0 comments

Decoding Sports Markets: Identifying Patterns, Unlocking Opportunities

https://instamatch.com/sportsbook/highlights
1•instamatch•6m ago•0 comments

Magisk v29.0

https://github.com/topjohnwu/Magisk/releases/tag/v29.0
1•tripdout•7m ago•0 comments

Ask HN: Teach me something new

1•carlos-menezes•7m ago•0 comments

Solving Scala's Build Problem with the Mill Build Tool [video]

https://www.youtube.com/watch?v=fyf2AWUyq24
1•lihaoyi•7m ago•0 comments

Stack Overflow seeks rebrand as traffic continues to plummet

https://devclass.com/2025/05/13/stack-overflow-seeks-rebrand-as-traffic-continues-to-plummet-which-is-bad-news-for-developers/
1•pseudolus•9m ago•0 comments

The Cybertruck was supposed to be apocalypse-proof

https://www.theguardian.com/us-news/ng-interactive/2025/may/14/tesla-cybertruck-durability-elon-musk
1•n1b0m•9m ago•0 comments

401(k) Giant to Allow Private Markets Investments in Its Retirement Portfolios

https://www.wsj.com/personal-finance/retirement/empower-401k-private-markets-retirement-accounts-fa74dd00
2•impish9208•10m ago•1 comments

The end of encryption as we know it?

https://www.theparliamentmagazine.eu/news/article/the-end-of-encryption-as-we-know-it
1•baal80spam•11m ago•0 comments

Ask HN: How do you use the knowledge gained in a day?

1•mdoliwa•12m ago•0 comments

The Most Interesting Facts about Gorillas in Rwanda

1•gracedav•12m ago•0 comments

Linear scalable read-write lock

https://uvdn7.github.io/shared-mutex/
2•ot•18m ago•0 comments

"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing

https://www.morling.dev/blog/streaming-vs-batch-wrong-dichotomy/
2•ingve•19m ago•1 comments

Bike-mounted sensor could boost the mapping of safe cycling routes

https://newatlas.com/bicycles/proxicycle-bicycle-sensor-safe-cycling-routes/
1•yunusabd•20m ago•1 comments

Mario Kart 64 decompiled documentation

https://n64decomp.github.io/mk64/index.html
2•fidotron•22m ago•0 comments

The Future Is Too Expensive – A New Theory on Collapsing Birth Rates

https://medium.com/@hectorchu1/the-future-is-too-expensive-a-new-theory-of-why-people-arent-having-kids-c3eca581c491
15•hectorchu•23m ago•6 comments

Embeddings Are Underrated

https://technicalwriting.dev/ml/embeddings/overview.html#underrated
2•sunilkumardash9•25m ago•0 comments

Chimpanzees use medicinal leaves to perform first aid, scientists discover

https://phys.org/news/2025-05-chimpanzees-medicinal-aid-scientists.html
2•pseudolus•25m ago•0 comments

If you're managing more than 2 projects at once, how are you doing it?

1•praveeninpublic•25m ago•0 comments

What has Elon Musk's Doge achieved?

https://www.ft.com/content/085430ab-27fe-46fc-a798-1059649d3b32
5•znq•26m ago•2 comments

The Cryptography Behind Passkeys

https://blog.trailofbits.com/2025/05/14/the-cryptography-behind-passkeys/
3•tatersolid•26m ago•0 comments

3D rendering of the Colosseum captures its architectural genius, symbolic power [video]

https://aeon.co/videos/a-3d-rendering-of-the-colosseum-captures-its-architectural-genius-and-symbolic-power
1•pseudolus•27m ago•0 comments

Airbnb Services and Experiences

https://www.airbnb.com/release
1•vortex_ape•33m ago•0 comments

C-suite at Alphabet make B-A-N-K from 2024 equity awards

https://www.theregister.com/2025/05/14/alphabet_exec_pay_2024/
2•rntn•34m ago•0 comments

Which Actions?

https://philosophyofbrains.com/2025/05/12/which-actions.aspx
2•synthetictask•34m ago•0 comments