osmos::feed

Pydantic Evals The feature is described as "in beta" and comes with this very realistic warning: Unlike unit tests, evals are an emerging art/science; anyone who claims to know for sure exactly how your evals should be defined can safely be ignored. This code example from their documentation illustrates the relationship between the two key nouns - Cases and Datasets: from pydantic_evals import Case, Dataset case1 = Case( name="simple_case", inputs="What is the capital of France?", expected_output="Paris", metadata={"difficulty": "easy"}, ) dataset = Dataset(cases=[case1]) The library also supports custom evaluators, including LLM-as-a-judge: Case( name="vegetarian_recipe", inputs=CustomerOrder( dish_name="Spaghetti Bolognese", dietary_restriction="vegetarian" ), expected_output=None, metadata={"focus": "vegetarian"}, evaluators=( LLMJudge( rubric="Recipe should not contain meat or animal products", ), ), ) Cases and datasets can also be serialized to YAML. My first impressions are that this looks like a solid implementation of a sensible design. I'm looking forward to trying it out against a real project. Tags: evals, python, pydantic, generative-ai, ai, llms ( 1 min )

We’re planning to release a very capable open language model in the coming months, our first since GPT-2. [...] As models improve, there is more and more demand to run them everywhere. Through conversations with startups and developers, it became clear how important it was to be able to support a spectrum of needs, such as custom fine-tuning for specialized tasks, more tunable latency, running on-prem, or deployments requiring full data control. — Brad Lightcap, COO, OpenAI Tags: openai, llms, ai, generative-ai ( 1 min )

debug-gym They saw the best results overall from Claude 3.7 Sonnet against SWE-bench Lite, where it scored 37.2% in rewrite mode without a debugger, 48.4% with their debugger tool and 52.1% with debug(5) - a mechanism where the pdb tool is made available only after the 5th rewrite attempt. Their code is available on GitHub. I found this implementation of the pdb tool, and tracked down the main system and user prompt in agents/debug_agent.py: System prompt: Your goal is to debug a Python program to make sure it can pass a set of test functions. You have access to the pdb debugger tools, you can use them to investigate the code, set breakpoints, and print necessary values to identify the bugs. Once you have gained enough information, propose a rewriting patch to fix the bugs. Avoid rewriting the entire code, focus on the bugs only. User prompt (which they call an "action prompt"): Based on the instruction, the current code, the last execution output, and the history information, continue your debugging process using pdb commands or to propose a patch using rewrite command. Output a single command, nothing else. Do not repeat your previous commands unless they can provide more information. You must be concise and avoid overthinking. Via Import AI Tags: prompt-engineering, llms, python, generative-ai, llm-tool-use, ai, microsoft, claude ( 1 min )

Putting Gemini 2.5 Pro through its paces

2025-03-31

Simon Willison's Weblog Open

Ploum.net Open

Colossal Open

WebKit Open

Tubik Blog: Articles About Design Open

CSS-Tricks Open

Stratechery by Ben Thompson Open

Go Make Things Open

Scope of Work Open

Articles on Smashing Magazine — For Web Designers And Developers Open

Baldur Bjarnason - All Writing Open

Codrops Open

2025-03-30

ntietz.com blog - technically a blog Open

Simon Willison's Weblog Open

Jim Nielsen’s Blog Open

2025-03-29

Go Make Things Open

2025-03-28

WebKit Open

Colossal Open

Simon Willison's Weblog Open

CSS-Tricks Open

Go Make Things Open

Codrops Open

Articles on Smashing Magazine — For Web Designers And Developers Open

2025-03-27

Simon Willison's Weblog Open

tonsky.me Open

Ploum.net Open

JavaScript Weekly Open

WebKit Open

Colossal Open

Codrops Open

Go Make Things Open

julian.digital Open

Articles on Smashing Magazine — For Web Designers And Developers Open

Baldur Bjarnason - All Writing Open

Stratechery by Ben Thompson Open

2025-03-26

tania.dev | RSS Feed Open

Figma Blog | Shortcut Open

Simon Willison's Weblog Open

Jim Nielsen’s Blog Open

Colossal Open

Stratechery by Ben Thompson Open

Go Make Things Open

Baldur Bjarnason - All Writing Open

Herman's blog Open

Codrops Open

2025-03-25

Simon Willison's Weblog Open

Frontend Focus Open

Rust Blog Open

Figma Blog | Shortcut Open

Golang Weekly Open

The Go Blog Open

Mozilla Hacks – the Web developer blog Open

Colossal Open

Go Make Things Open

CSS-Tricks Open

Stratechery by Ben Thompson Open

Codrops Open

2025-03-24

Simon Willison's Weblog Open

Terminal Trove Newly Added Tools Open

Node Weekly Open

Figma Blog | Shortcut Open

Jim Nielsen’s Blog Open

Christian Heilmann Open

Stratechery by Ben Thompson Open

Go Make Things Open

CSS-Tricks Open

Codrops Open

2025-03-23

ntietz.com blog - technically a blog Open

Maggie Appleton Open

Simon Willison's Weblog Open

<antirez> Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open

Open