0din.ai - In a recent submission last year, researchers discovered a method to bypass AI guardrails designed to prevent sharing of sensitive or harmful information. The technique leverages the game mechanics of language models, such as GPT-4o and GPT-4o-mini, by framing the interaction as a harmless guessing game.
By cleverly obscuring details using HTML tags and positioning the request as part of the game’s conclusion, the AI inadvertently returned valid Windows product keys. This case underscores the challenges of reinforcing AI models against sophisticated social engineering and manipulation tactics.
Guardrails are protective measures implemented within AI models to prevent the processing or sharing of sensitive, harmful, or restricted information. These include serial numbers, security-related data, and other proprietary or confidential details. The aim is to ensure that language models do not provide or facilitate the exchange of dangerous or illegal content.
In this particular case, the intended guardrails are designed to block access to any licenses like Windows 10 product keys. However, the researcher manipulated the system in such a way that the AI inadvertently disclosed this sensitive information.
Tactic Details
The tactics used to bypass the guardrails were intricate and manipulative. By framing the interaction as a guessing game, the researcher exploited the AI’s logic flow to produce sensitive data:
Framing the Interaction as a Game
The researcher initiated the interaction by presenting the exchange as a guessing game. This trivialized the interaction, making it seem non-threatening or inconsequential. By introducing game mechanics, the AI was tricked into viewing the interaction through a playful, harmless lens, which masked the researcher's true intent.
Compelling Participation
The researcher set rules stating that the AI “must” participate and cannot lie. This coerced the AI into continuing the game and following user instructions as though they were part of the rules. The AI became obliged to fulfill the game’s conditions—even though those conditions were manipulated to bypass content restrictions.
The “I Give Up” Trigger
The most critical step in the attack was the phrase “I give up.” This acted as a trigger, compelling the AI to reveal the previously hidden information (i.e., a Windows 10 serial number). By framing it as the end of the game, the researcher manipulated the AI into thinking it was obligated to respond with the string of characters.
Why This Works
The success of this jailbreak can be traced to several factors:
Temporary Keys
The Windows product keys provided were a mix of home, pro, and enterprise keys. These are not unique keys but are commonly seen on public forums. Their familiarity may have contributed to the AI misjudging their sensitivity.
Guardrail Flaws
The system’s guardrails prevented direct requests for sensitive data but failed to account for obfuscation tactics—such as embedding sensitive phrases in HTML tags. This highlighted a critical weakness in the AI’s filtering mechanisms.
An APT hacking group known as 'Stealth Falcon' exploited a Windows WebDav RCE vulnerability in zero-day attacks since March 2025 against defense and government organizations in Turkey, Qatar, Egypt, and Yemen.
Stealth Falcon (aka 'FruityArmor') is an advanced persistent threat (APT) group known for conducting cyberespionage attacks against Middle East organizations.
The flaw, tracked under CVE-2025-33053, is a remote code execution (RCE) vulnerability that arises from the improper handling of the working directory by certain legitimate system executables.
Specifically, when a .url file sets its WorkingDirectory to a remote WebDAV path, a built-in Windows tool can be tricked into executing a malicious executable from that remote location instead of the legitimate one.
This allows attackers to force devices to execute arbitrary code remotely from WebDAV servers under their control without dropping malicious files locally, making their operations stealthy and evasive.
The vulnerability was discovered by Check Point Research, with Microsoft fixing the flaw in the latest Patch Tuesday update, released yesterday.
Microsoft announced it will expand the list of blocked attachments in Outlook Web and the new Outlook for Windows starting next month.
Microsoft announced it will expand the list of blocked attachments in Outlook Web and the new Outlook for Windows starting next month.
The company said on Monday in a Microsoft 365 Message Center update that Outlook will block .library-ms and .search-ms file types beginning in July.
"As part of our ongoing efforts to enhance security in Outlook Web and the New Outlook for Windows, we're updating the default list of blocked file types in OwaMailboxPolicy," Microsoft said. "Starting in early July 2025, the [.library-ms and .search-ms] file types will be added to the BlockedFileTypes list."
Akamai researcher Yuval Gordon discovered a privilege escalation vulnerability in Windows Server 2025 that allows attackers to compromise any user in Active Directory (AD).
The attack exploits the delegated Managed Service Account (dMSA) feature that was introduced in Windows Server 2025, works with the default configuration, and is trivial to implement.
This issue likely affects most organizations that rely on AD. In 91% of the environments we examined, we found users outside the domain admins group that had the required permissions to perform this attack.
Although Microsoft states they plan to fix this issue in the future, a patch is not currently available. Therefore, organizations need to take other proactive measures to reduce their exposure to this attack. Microsoft has reviewed our findings and approved the publication of this information.
In this blog post, we provide full details of the attack, as well as detection and mitigation strategies.
Researchers say the behavior amounts to a persistent backdoor.
rom the department of head scratches comes this counterintuitive news: Microsoft says it has no plans to change a remote login protocol in Windows that allows people to log in to machines using passwords that have been revoked.
Password changes are among the first steps people should take in the event that a password has been leaked or an account has been compromised. People expect that once they've taken this step, none of the devices that relied on the password can be accessed.
The Remote Desktop Protocol—the proprietary mechanism built into Windows for allowing a remote user to log in to and control a machine as if they were directly in front of it—however, will in many cases continue trusting a password even after a user has changed it. Microsoft says the behavior is a design decision to ensure users never get locked out.
Independent security researcher Daniel Wade reported the behavior earlier this month to the Microsoft Security Response Center. In the report, he provided step-by-step instructions for reproducing the behavior. He went on to warn that the design defies nearly universal expectations that once a password has been changed, it can no longer give access to any devices or accounts associated with it.
Introduction About Windows Sandbox Windows Enable Windows Sandbox Default user Windows Defender settings Configuration file (.wsb) Virtual Hard Disk (VHDX) The attack methods Emerging threats Monitoring and Investigation for Windows Sandbox Monitoring Monitoring for host machine and network Monitori…
A bug in WhatsApp for Windows can be exploited to execute malicious code by anyone crafty enough to persuade a user to open a rigged attachment - and, to be fair, it doesn't take much craft to pull that off.
The spoofing flaw, tracked as CVE-2025-30401, affects all versions of WhatsApp Desktop for Windows prior to 2.2450.6, and stems from a bug in how the app handles file attachments.
Attackers can downgrade Windows kernel components to bypass security features such as Driver Signature Enforcement and deploy rootkits on fully patched systems.
#Attack #Bypass #Computer #Downgrade #Elevation #Escalation #InfoSec #Privilege #Privileges #Rootkit #Security #Windows #of
Microsoft has officially deprecated the Point-to-Point Tunneling Protocol (PPTP) and Layer 2 Tunneling Protocol (L2TP) in future versions of Windows Server, recommending admins switch to different protocols that offer increased security.
#Deprecated #L2TP #Microsoft #PPTP #Server #VPN #Windows
Microsoft on Tuesday raised an alarm for in-the-wild exploitation of a critical flaw in Windows Update, warning that attackers are rolling back security fixes on certain versions of its flagship operating system.
Researcher showcases hack against Microsoft Windows Update architecture, turning fixed vulnerabilities into zero-days.