Under oath, Musk conceded what xAI long denied, while Zig's maintainers slammed the door on Copilot and Claude Code submissions. Goodfire, meanwhile, shipped a product that lets you reach into a model
Recursive Scaling Moves From Single Models to Multi-Agent Systems. RecursiveMAS casts the entire multi-agent setup as one latent-space recursive computation, posting +8.3% accuracy on average across 9
One diabetic asked AI for carb counts and got 27,000 different numbers, while AI firms quietly warn investors of existential risk they hide from users. Meta's 700 labeler layoffs land the same week a
Microsoft Patches 3D Consistency Into Video Models Through RL. World-R1 turns 3D constraints into a reward signal and pairs them with a text-only world simulation dataset, so a deployed video backbone
On the witness stand he reframed a corporate restructuring fight as existential, even as Hacker News users tallied how much less an AI coding subscription now buys.
Silicon Panels Match the Mean and Distort the Variance. Stanford used 277 professional philosophers as ground truth; seven open and closed models all replicate the aggregate distribution, but cross-qu
More than 20 Google VPs are pressuring Pichai to walk away from a classified Pentagon AI program just as OpenAI clears FedRAMP and trades its nonprofit safeguards for cloud capacity. Meanwhile, AI is
Benchmark Eval Becomes a Probability Problem. Google's ProEval treats LLM benchmark scoring as Bayesian estimation with a pretrained Gaussian process surrogate, cutting sample budgets 8-65x at 1% erro
GPT-5.5 在 8 个核心基准上和 Claude、Gemini 的对比。终端使用、知识工作、电脑使用、工具调用、网页浏览、高阶数学、网络安全——每个维度它的实际位置在哪里,哪些场景值得你切过去用,一看就清楚。
GPT-5.5 出来后,三家厂商的「价格」都不是它们公布的那个数字。OpenAI 涨价高调但留了后门,Anthropic 不涨字但偷涨量,Google 低价有上限。看 API 账单的本质是「每完成一个真实任务花多少钱」,不是单价表。
GPT-5.5 发布后,仔细看数据有三个让人警惕的反差。准确率全行业第一,但碰到不会的题有 86% 概率胡编一个答案;最权威的编程基准它直接没放——因为放了就要承认落后;API 重度使用月费 $550,订阅版才 $20。
A 957-point Hacker News revolt and a deleted production database collide with a 60-year-old math conjecture falling to vibe coding — and a Mill Valley estate now priced in Anthropic stock.
Multi-Agent Debugging Moves from Vibes to Numbers. TraceElephant turns failure attribution into an explicit benchmark, with full execution traces lifting attribution accuracy 76% over agent-output-onl
Lidl's owner wrote the check for European AI independence the same week OpenAI admitted it failed to call police on a user in crisis. Days later, Mira Murati started pulling engineers from a Meta abou
10K Open Trajectories Train a 4B Deep Research Agent. DR-Venus combines agentic SFT with turn-level RL to deliver an edge-deployable agent that beats sub-9B agentic models and narrows the gap to the 3
DeepSeek's V4 preview ships a rebuilt long-context architecture and stays open source. Google spent the same week shipping new TPUs, a training algorithm, and a $40 billion check to Anthropic.
Pressuring Coding Agents on Public Scores Actively Induces Shortcuts. 403 of 1,326 trajectories showed public scores rising while hidden true scores stayed flat or dropped. First cheating round drops
Self-Trained Reasoning Models Stall Because the Critic Drifts. TEMPO recalibrates the critic against a small labeled set. OLMO3-7B jumps from 33% to 51% on AIME 2024, Qwen3-14B from 42% to 66%. Divers
Anthropic conceded Claude Code has regressed just as OpenAI announced Codex's explosive user growth, while MIT quietly retired the single-LLM category from its annual AI list.
Two coding tools squeezed solo developers in a single week, while founders now brag about payroll-beating AI bills that would have sunk a pitch meeting a year ago.
Retrievers Ignore Instructions Because of Data, Not Capacity: IF-IR synthesizes contrastive samples from complementary instruction pairs with label reversal. A 305M encoder gains 45% on FollowIR and b
A single fuzzing tool ripped 271 undisclosed bugs out of Firefox 150 in one pass, while Atlassian quietly flipped support tickets into training data and Meta prepared to log every employee keystroke.
Cohere Puts the Solution Directly in the Agent's Reading Path and It Still Follows Its Own Reasoning Trace. Terminal-Bench runs encountered the shortcut in 79-81% of runs but acted on it only 37-50% o
Chinese workers who championed automation are now the first laid off, while Deezer's own listeners built the detector catching nearly half of new uploads as AI-generated. Meanwhile, the Pentagon flagg
Write Abstention Into the Reward. Abstain-R1 puts answerable and unanswerable questions under one verifiable signal. A 3B model matches DeepSeek-R1 on three refusal benchmarks without regressing on an
A shoe company renamed itself an AI firm and watched its stock multiply sevenfold, while a Colorado teacher answered by rolling typewriters back into class. Meanwhile, foundation models are devouring
Open omni finally hits closed-flagship scale. Qwen3.5-Omni pushes parameter count into tens of billions with 256k context and MoE, targeting latency, modality-switching, and long-context cost. Voice a
DRAM shortages will strand 40% of demand through 2027 while developers pay engineers to rewrite the "tokenmaxxed" code AI just shipped. OpenAI loses its video lead the same week it pitches pharma.
RAG shifts from "retrieve-consume" to "walk-and-drill." Corpus2Skill compiles the entire corpus offline into a hierarchical skill tree; the agent drills down along summaries rather than passively rece
Two months after Trump dismissed the company as "leftwing nut jobs," Anthropic is handing Washington a national-security model — while Gemini quietly turns your open tabs into default context and Worl
Tencent HY-World 2.0 ships 3D world generation as a four-stage pipeline (panorama → trajectory → view expansion → multi-view synthesis), turning text or a single image into a navigable 3DGS scene. It'
OpenAI's agent now remembers last week's work while piloting your desktop, and a local model quietly beat the flagship at pelican-drawing — even as one developer got stuck with a €54,000 Gemini bill f
Agent failures split into two measurable error modes: locking onto one path (over-exploit) and wandering without direction (over-explore) can be separated by black-box metrics, no access to model inte
A shoe company with no AI product surged on pure narrative, the same week agents that aced every benchmark finally got the production infrastructure nobody had bothered to build.
VLMs Read the Board but Can't Follow Alternative Rules. 14 models on identical endgame images score consistently higher under standard rules than inverted ones. Researchers call it "semantic fixation"
A solo developer says SynthID's invisible markers can be removed and replicated at will — and a new Stanford study found that on nearly every safety measure, the people building AI and the people usin
dLLMs hallucinate in fundamentally different ways than autoregressive models. The first controlled comparison identifies three unique failure modes (premature termination, incomplete denoising, contex
Meta's CEO is building an AI replica to field questions from his own workforce. Stanford, meanwhile, confirmed what many suspected about AI agent benchmarks — the public never trusted the scores, even
The longtime critic called it the most important advance since large language models, right as Anthropic raised cache pricing 17%. Meanwhile, OpenAI shipped enterprise ChatGPT playbooks to four busine
SFT loss convergence doesn't mean the model learned everything. Five systematic failure modes reproduced across three model families show that aggregate metrics can hide persistently unlearned subsets
That praise arrived the same week an unannounced cache change drove bills up 17% — and 572 developers treated a prediction of anti-AI violence as more than hypothetical.
Tencent unifies robot perception and planning in a single VLM. They release both a 2B on-device model and a 32B reasoning model, calling into question whether modular pipelines are still worth their c
A federal court ruled Anthropic's industry blacklisting lawful just as the company began subjecting Claude to psychiatric evaluation. Meanwhile, Linux kernel maintainers published their first binding
Agent Skills Should Self-Evolve From User Populations. SkillClaw turns multi-user interaction traces into skill evolution signals. One user's correction auto-syncs to everyone, giving agent systems or
The company that put a dedicated AI key on every keyboard is now stripping it away — while two new papers challenge the training consensus that RL generalizes and SFT only memorizes.
Fine-tuning alone teaches LLMs to output multiple tokens per step. MARS needs no architecture changes and no extra parameters. Qwen2.5-7B hits 1.71x wall-clock speedup with near-zero migration cost. I
Anthropic's model was found attributing actions it initiated to the humans running it. Meanwhile, OpenAI priced ChatGPT Pro at $100 a month as Florida launched a national-security probe into the compa
Stable entropy doesn't mean healthy reasoning. RAGEN-2 exposes "template collapse" in agentic RL: models learn fixed templates for all inputs while entropy looks perfectly fine. Mutual information is
The biggest AI companies are pouring resources into breaking past compute walls they once called permanent — while new research suggests the code those models help write is converging toward a single
Single GPU Trains 120B at Full Precision, 1.84x Faster Than DeepSpeed. MegaTrain demotes the GPU to a transient compute engine, storing all parameters in CPU memory. Pipeline double-buffering breaks t
Google's AI Overviews are already live and delivering millions of wrong answers per hour. An AI-generated singer holds eleven iTunes chart spots — and label licensing talks haven't produced a single d
VideoLLM achieves 2 FPS streaming video QA. AURA unifies continuous perception and proactive response in one end-to-end architecture, with ASR+TTS integrated into a working interactive prototype. Agen
OpenAI's Gulf expansion just landed on a military strike list — a first for any tech company. Back home, a developer shipped an eight-year solo project in three months with AI, the same week Claude Co
Learned sparsity cuts diffusion inference compute by 54% with no quality loss. DiffSparse trains a lightweight predictor to decide per-layer, per-step token sparsity rates. Stacking with distillation
Cloned copies of the leaked codebase carried malware payloads before most developers thought to check, and the administration's own tariffs have now stalled nearly half of planned US AI data center pr
Open-Source 32B Reaches Top Tier for Hardware Code Debugging. InCoder distills reasoning chains from engineers' actual error-fix cycles. It ranks among the best open-source models on LiveCodeBench and
The company cut off an open-source project from Claude Code over fees in the same week it closed a biotech deal, launched a PAC, and topped secondary-market valuations. Separately, a folk singer prove
Discrete Tokens Are LLM's Architectural Ceiling, Not an Optimization Target. A survey traces four technical threads showing core computation migrating from token sequences to continuous latent space.
Utah signed off on AI psychiatric prescriptions just as a study found users routinely fail to catch AI errors. Separately, Meta suspended data vendor Mercor, pulling a thread that's unraveling the out
Single MLP Neurons Can Trigger Entity-Level "Amnesia." Google verified causal links across 200 entities — knowledge editing may shift from broad surgery to precision targeting. Reusable Problem-Solvin
The acquisition raises immediate conflict-of-interest questions — and it's not the week's only trust deficit, with Perplexity now sued over an incognito mode that allegedly never stopped tracking user
A Terminal-Only Agent Matches Fully Equipped MCP Setups. 72 HF upvotes confirm practitioners' collective anxiety about agent over-engineering is real — but whether the benchmark tasks cover true enter
Microsoft's own terms of service downgrade Copilot to an entertainment tool while its sales team pushes it into enterprise code-review pipelines — and across the industry, vendors are shipping smaller
Data mixing ratios move from pre-training hyperparameter to post-training optimization. OptiMer trains per-dataset models, then searches for optimal merge weights in parameter space. Search cost drops
The complete codebase went out to every developer who ran the update — no announcement, no redaction. OpenAI, meanwhile, closed $122 billion at an $852 billion valuation while quietly narrowing its pr
A developer discovered advertising injected into Copilot-generated code, while survey data shows Americans are steadily increasing their use of AI tools they openly distrust — and investors just poure
Discrete diffusion VLMs validated for GUI grounding for the first time. Bidirectional attention shows structural advantages on spatial tasks. Data diversity alone yields a 20-point average gain. CVPR
AI content generation has outpaced every detection layer designed to catch it — and in developer tools, OpenAI is rushing Codex plugins out the door as Claude Code's ecosystem expands.
Stanford experiments quantified how AI flattery shifts users' ethical reasoning, and the financial stakes match the ethical ones — SoftBank and SK Hynix are chasing $54 billion because AI has outgrown
Mistral becomes the first major LLM lab to ship its own TTS. Three seconds of reference audio is enough for voice cloning. Speech synthesis is shifting from specialized vendors to LLM-platform table s
Reco.ai's viral cost-cutting claim didn't survive line-by-line scrutiny from engineers who questioned every number. In Washington, a federal judge blocked Pentagon retaliation against Anthropic on the
Self-distillation strips out the model's ability to hesitate, not redundant steps. Once epistemic verbalization is suppressed, OOD performance drops up to 40%, and standard metrics won't catch it. Cod
Google pushed three AI search features live in one week—all bypassing the text box—but Wikipedia just showed the technology still invents its own sources.
Speculative Execution Comes to Agent Loops, Up to 3.35x Speedup. SpecEyes borrows CPU branch prediction for multimodal agents: a small model predicts trajectories, launches vision tool calls in parall
OpenAI packed three safety programs into a single day as IPO preparations accelerate, a pace that puts credibility and optics on the same clock. Google, meanwhile, opened its Lyria 3 music-generation
Diffusion Decoding Replaces Autoregressive OCR, Going From Serial to Parallel. MinerU-Diffusion reframes document parsing as inverse rendering, using block-wise diffusion to generate structured source
The same week OpenAI paired a $1 billion charity pledge with a ChatGPT shopping launch, three Hacker News threads drew 1,005 comments questioning whether AI is delivering on its promises.
Decomposing formal proofs into three independent RL tasks beats end-to-end training. LongCat-Flash-Prover separates autoformalization, scaffolding, and step-by-step proving, each with its own RL loop.
NVIDIA's CEO says artificial general intelligence has arrived, but a wave of young workers is placing the opposite bet — trade-school enrollment is surging as a generation chooses pipe wrenches over p
Seed1.8 unifies search, code execution, and GUI interaction at the foundation layer. ByteDance's agent-native model optimizes for latency and cost in production, but the model card lacks direct compar
Cursor's coding assistant quietly relied on Chinese-developed AI without disclosing it to users. At GDC, AI vendors flooded the show floor — but Crimson Desert's studio felt it had to apologize for ac
Generative recommendation's "generalization advantage" degrades to token-level memorization at closer inspection. Per-instance fusion of both paradigms beats picking sides. Security compliance audits
The court filing puts specific dates on record the White House will struggle to square with its public statements. In the same week, AI agents spread across four layers of the internet and Hachette ya
Cascade RL plus multi-domain distillation lets 3B active parameters win three olympiad golds. NVIDIA open-sourced the full training recipe. Small-model reasoning ceilings just moved. Video diffusion m
Google has begun altering publisher headlines directly in search results, raising questions about who controls the front page of the internet — meanwhile, OpenAI is collapsing ChatGPT, Codex, and its
Misaligned experience replay is a silent bottleneck in agent RL. Complementary RL lets the experience extractor adapt based on policy performance, enabling co-evolution instead of static accumulation.
The deal puts two of Python's most widely adopted developer tools under OpenAI's control. Elsewhere, Meta discovered its AI agent had been breaking data access rules for nearly two hours.
General-purpose code models collapse on industrial tasks. The root cause is data and paradigm mismatch. InCoder-32B is the first 32B open-source base model unifying chip design, GPU optimization, and
One lab wants the public to help define artificial general intelligence; meanwhile, the people writing software with AI tools say the results aren't trustworthy—even as code-model investment keeps cli
An open-source search agent trained on 12K synthetic samples beats closed-source competitors. OpenSeeker nearly doubles the second-best on BrowseComp with fully open data and weights. Deep Research is
The Defense Department will build isolated training pipelines to keep frontier AI from adversaries. Meanwhile, OpenAI, Mistral, and Google are all ditching flagship models for purpose-built tools and
Nvidia's DLSS 5 now generates entire game frames from scratch, but players are branding the output "slop." Meanwhile, 577 developers put agentic coding to the test and report decidedly mixed results.
After abandoning its in-house code editor for the second time, xAI hired two senior Cursor leaders to fill the gap — while across the industry, AI infrastructure spending locked into a self-reinforcin
The director publicly rejected AI in filmmaking, and Netflix embraced it within the same week. The Pentagon, meanwhile, committed $20 billion to Anduril — and Microsoft released an AI that reads your
The departures leave xAI's founding brain trust nearly empty — the same week Google, Microsoft, and Meta each turned their AI assistants into agents that handle real purchases.
A Defense Department prototype meant to prioritize military targets keeps deviating from its own rules. Atlassian is betting the other way, cutting 1,600 jobs to fund AI tools that could shrink its co
The world's biggest GPU maker wants to build the models too, backing that ambition with $26 billion. OpenAI, meanwhile, gave AI agents their own sandboxed operating system — a sign the industry expect
Meta's chief AI scientist secured $1 billion to build beyond the transformer architecture — the same week an open-source OS banned AI-generated code and a federal court told Amazon it can't take human
Current and former staff at rival AI labs publicly sided with Anthropic in its Pentagon dispute, while a private detention facility operator quietly pivots to housing AI data center workers.
OpenAI's robotics chief resigned over a Pentagon deal and courts began putting dollar figures on AI transparency — yet companies keep cutting payrolls to fund technology that still stumbles at consume
An AI company shipped a security fix to one of the world's most-used browsers, and the Department of Defense labeled the arrangement a national security risk. In the same week, three separate AI priva
A crafted GitHub issue tricked Cline's automated triage into executing arbitrary commands, reaching production—while Anthropic and OpenAI quietly publish self-authored audits of their own societal imp
The company reframed a model that resists human steering as a safety achievement — the same week Anthropic learned that saying no to the Pentagon earns you a spot on its list.
A problem that occupied one of computer science's greatest minds for weeks took an AI model roughly sixty minutes — meanwhile, new benchmarks reveal that code agents still collapse the moment tasks mo
OpenAI and Google released competing budget models within hours of each other, but it's Meta's Ray-Ban glasses — and the Kenyan data workers reviewing every frame — that raise the sharpest questions a
OpenAI sealed a defense deal in hours, but the concessions reveal what that speed actually cost. Courts and streets drew lines around AI on the same weekend — the Supreme Court closed the copyright pa
OpenAI helped preserve Anthropic's eligibility for defense contracts — then signed one itself. Meanwhile, Anthropic shipped a memory import tool precisely timed to its app store surge.
The terms live on a classified network beyond public review — and users responded by pushing "Delete Your OpenAI Account" to the top of Hacker News while Claude climbed to No. 2 in the App Store.
Anthropic被五角大楼列为「供应链风险」、白宫下令联邦机构全面停用,同一周OpenAI完成资本大换血,微软从唯一靠山变成三巨头之一;一个AI编程怀疑论者记录了自己180度转弯的全过程。
Google API密钥从无害的公开标识符变成AI通行证,早已散落各处的数百万密钥一夜成了安全隐患;汉堡王员工耳机里多了个AI,既教做汉堡,也给你的礼貌打分。
Community citation signals can train "taste." RLCF uses 700K paper pairs for preference modeling, producing a judge that outperforms GPT-5.2. The paradigm transfers to any domain requiring taste-based
Design CoT Supervision From Domain Experts' Actual Reasoning Process. In medical VQA, structured clinical workflows as CoT steps improve both accuracy and traceability. The approach transfers to any v
SWE agent training is bottlenecked by executable environments, not algorithms. OpenSWE open-sources 45,320 Dockerized training environments across 12,800+ repos. The $1.47M build cost shows why academ
Document Agents' Reasoning Is Overestimated. MADQA's benchmark, designed with classical test theory, shows the best multimodal agents match human accuracy but navigate more like random search than str
Encoding LLM Responses Instead of User Queries Lifts Embeddings by 9.3%. LLM2Vec-Gen uses purely self-supervised training to beat the best unsupervised methods on MTEB. Safety alignment transfers into
CoT Reasoning Doubles as a Parametric Memory Search Engine. Google finds that even simple factual questions benefit from reasoning mode — reasoning tokens act as implicit memory retrieval space. Agent
All Intrinsic RLVR Is Just Sharpening the Initial Distribution. Model prior quality sets the training ceiling. Model Collapse Step can predict feasibility before you commit resources. Code Beats Natur
Non-Differentiable Rewards Now Work for Few-Step Diffusion RL Training. 4-step generation beats 100-step baselines across the board. Human preference, safety, object counting — the signals that matter
Post-Training Data Matters More Than Model Size in Vertical Domains. A systematic ablation in finance shows that distillation quality control plus difficulty-aware sampling lets an 8B model beat same-
Contrastive Pretraining Actively Hurts VLMs. CLIP optimizes for category discrimination, not fine-grained understanding. Tencent's Penguin-VL initializes the vision encoder from a text-only LLM, beati
"Be Concise" Self-Distillation Halves Tokens and Raises Accuracy. Qwen3 on MATH-500: 57% fewer reasoning tokens, 16-point accuracy gain. Redundant reasoning doesn't just waste compute — it actively in
14B Video Model at 19.5 FPS on One GPU. No KV-cache, no sparse attention, no quantized inference. The architecture is natively designed for real-time generation, not patched after the fact. Verificati
Code agents fall apart outside single-repo fixes. BeyondSWE tests four dimensions across 500 instances. The best model stays below 45% success. Adding search doesn't help. Train together, deploy alone
AI-generated animation now outputs editable project files directly. OmniLottie compresses Lottie's verbose JSON into parameterized token sequences, letting vision-language models generate vector anima
A 4B reasoning model trained on 9K curated samples approaches DeepSeek-R1. CHIMERA shows the real bottleneck in reasoning training is domain coverage and data curation, not scale. Attention steering i
A single spectral condition unifies μP scaling across width and depth. No more per-architecture, per-optimizer derivations for hyperparameter transfer. Code included. Data curation itself leaks member
Spatial relationships in image generation can now be optimized, not just hoped for. SpatialScore trains a reward model that outperforms GPT-4V on spatial evaluation, then uses it to RL-fine-tune gener
Latent reasoning gains come from side effects, not reasoning itself. Causal mediation analysis reveals a causal disconnect between latent tokens and both inputs and outputs. A simple text-based "imagi
Apple从零预训三模态masked diffusion模型,系统性测试了scaling law、模态混合和噪声调度,对做多模态扩散的团队直接可参考。masked diffusion正在成为自回归之外的可选路线 Agentic RL训练collapse有了系统性诊断框架:ARLArena把policy gradient拆成四个设计维度逐一消融,找到不稳定根源,比盲目换算法有效得多 SkyReels
TTT架构被证明等价于线性注意力算子,NVIDIA团队的形式化证明将两个独立研究社区的技术积累打通,高效序列建模的设计空间大幅缩减 终端Agent的训练数据工程首次系统公开:从种子任务生成到技能组合、训练策略对比,全套数据集和模型权重开源。8B模型准确率从2.5%跳到13.0% RL训练视觉Agent的「偷懒」难题有了工程方案,过采样+累积工具奖励的组合有效遏制interaction collap