OpenAI's new financial connector ships to Pro subscribers in the US only, the same week three Hugging Face benchmarks agreed agent memory isn't production-ready. Meanwhile Hacker News debated whether
Readable Dynamics Don't Belong in Weights. Enterprise World Models use CascadeBench to show that cross-tenant business rules get more brittle the better they're learned. 58 upvotes are redrawing the l
In the same week Anthropic inked a $200M Gates Foundation deal and launched Claude for Small Business, its biggest partner quietly began stripping Claude Code from Microsoft developers. Meanwhile, pus
δ-mem Bolts an 8×8 State Matrix Onto a Frozen Backbone. A delta-rule online update lifts memory-heavy tasks 10–15% over baseline. Reframes long context from "stretch the window" to "design a state mac
CMS just created the first reimbursement code for autonomous AI agents managing patients between visits, while engineers using OpenAI's Codex report their own skills atrophying. Meanwhile, researchers
Image Generation Alignment and LLM Post-Training Now Share One Toolbox. Flow-OPD ports On-Policy Distillation to flow matching. SD 3.5 Medium hits GenEval 92 (up from 63) and OCR 94 (up from 59), abou
Tokenmaxxing" entered corporate vocabulary the same week GM gutted hundreds of IT roles, while a teenager's complete ChatGPT history became evidence against OpenAI in court. A third builder skipped wr
Geometry Conflict Predicts Continual Fine-Tuning Forgetting. Treating each task's parameter-update covariance as a measurable signal, GCWM beats data-free baselines on Qwen3 0.6B-14B across both domai
CollabVR Splits Video Reasoning Between VLM and VGM. Step-level closed loop holds long-horizon goals while curbing short-horizon simulation drift. External supervision stacks with VGM-side reasoning f
Hollywood writers earn $50/hour scoring scripts from the same systems that pushed them out, while Google's AI exposed a flaw human researchers missed and OpenAI rushed Daybreak out the same week.
Cloudflare cited AI displacement for its biggest workforce cut even as quarterly revenue peaked, while Chrome installed a 4GB on-device model on user machines without prompting consent.
Skill1 Unifies Skill Retrieval, Use, and Distillation in One Policy. A single task reward co-trains all three, avoiding interference between competing reward signals. SkillOS attacks the same problem
One pledged $40B in equity, another $55B in concrete, a third rode 490% in stock — three opposite bets on the same bottleneck. The same week, lawmakers moved to ban AI toys and developers found Claude
DeepMind's research agent has turned inward, rewriting the infrastructure of its own creators, while OpenAI's internal Codex security playbook is quietly becoming the floor every coding agent must cle
10.6k Curated Trajectories Match a Four-Stage RL Pipeline. OpenSeeker-v2 expands knowledge graph and tool set, applies strict low-step filtering. Pure SFT on a 30B model beats Tongyi DeepResearch's fu
Parloa and Uber are already running on OpenAI's new voice stack, even as ASUS confirms five million motherboards won't ship in 2025 because foundry capacity got routed to AI. Meanwhile, OpenAI's presi
Multi-Turn Agent RL Collapse May Not Be a Credit Assignment Problem. T²PO uses model self-uncertainty to trigger thinking and resampling. Stability and final performance both rise on WebShop, ALFWorld
Four senior engineers tried to pin down "AI coding" this week and produced four different answers, even as Anthropic kept stacking infrastructure deals to feed the work. Meanwhile hackers grumble that
Multi-Object Generation Failures Need Attribution Before Solutions: T2I multi-object failures come from scene complexity, not class imbalance. Concept-level issues respond to more data; compositional
Character.AI's "doctor" had a Texas license number that didn't exist, and Andon Labs' cafe AI ordered 120 eggs for a kitchen with no stove. Apple's quarter-billion-dollar Siri rebuild ended the same w
Anthropic 的 Claude Design 和 Google 的 Stitch,是目前最被讨论的两个 AI 设计工具。我用两个真实甲方项目(一个 toC iOS 食材详情页,一个 toB 仓库后台)让它们正面打一架。同一份 prompt、首版直出、各两次迭代、7 个维度打分。第一题 36:25 Claude 完胜,第二题 34:32 几乎打平但方向不同。最后给出一套不需要二选一的「按场景挑
Both labs spun up dedicated finance arms to sell enterprise AI the same week the foundational study behind ChatGPT's classroom rollout was retracted — even as OpenAI, Google, and Microsoft accelerated
KC Green says Artisan never licensed the meme it built its anti-hiring campaign around, while Mythos's "breakthrough" cyber result turned out to be matched by a Chinese model that also outcoded Anthro
GenLIP Pre-Trains ViT With an LM Objective Directly: dropping CLIP's contrastive stage and text decoder, 8B samples match larger-data baselines on multimodal benchmarks, and multi-resolution continuat
Dawkins read Claude's writing aloud and pitched the AI to his podcast listeners, while DeepSeek V4 closed in on frontier performance with a paper showing the same tier can be fine-tuned on a single 30
Heterogeneous scientific foundation model collaboration: Eywa pulls LLMs back from "general solver" to coordinator, handing protein structure and physics simulation tasks to domain-specialized predict
The vetted-access rollout breaks a public commitment made just months ago, while a Pentagon-linked nonprofit pays TikTok creators to warn Americans away from Chinese AI labs.
Cross-Architecture Distillation Shrinks dLLMs From 8B to 0.6B. TIDE is the first dLLM distillation framework where teacher and student differ in architecture, attention mechanism, and tokenizer at onc
中文电商场景下,胜负的分水岭不是画质,是「产品文字能不能保住」。5 场实测:banner、模特一致性、九宫格、生活场景图、背景替换。GPT Image 2 拿下 3 场,Nano Banana 2 拿下 2 场。最后给一套不用二选一的「模型路由」组合方案。