What principle guides every test we run?
We treat every claim an app makes as a hypothesis to be measured, not a feature to be listed. A tracker that advertises "AI-accurate photo logging" has to prove it against weighed food, on the same plates we give every other app. Our job is to make the testing boring, repeatable, and reproducible by a second analyst — so the conclusions hold up even when they are inconvenient.
What is in the 1,400-dish, 24-country benchmark dataset?
The dataset is the backbone of our data-accuracy, international-food, and barcode scores. It contains 1,400 meals and dishes sourced from 24 countries, each with a reference weight and a verified nutrition breakdown:
- Single foods — weighed portions of whole and prepared foods.
- Packaged products — items with scannable barcodes and label data.
- Restaurant and takeaway dishes — real plates from real menus.
- Mixed and international dishes — the layered, sauced, and regional foods, across 24 countries, that expose the gap between a demo and a daily driver.
We deliberately over-weight messy, real-world and non-Western food. Most apps look excellent on a plain chicken breast and fall apart on a bowl of mixed curry or a regional rice dish. The dataset is refreshed each cycle so it keeps pace with new products and menu items.
How do we measure accuracy across 134,000 photos and descriptions?
We ran 134,000 photos and dish descriptions of those 1,400 items through the apps — varying angle, lighting, framing, and the free-text phrasing a real person would use — then compared each estimate against the reference value. We report accuracy as error bands, not a single hero number, because a tracker that is usually close but occasionally wildly off is a different tool than one that is steadily reliable. Each item is tested by one analyst and spot-checked by a second; disagreements are re-run.
What are the 10 scoring criteria, and how are they weighted?
Each app earns a 0–10 score on ten criteria. The overall score is a weighted composite — data accuracy, AI nutritional guidance, and international food and barcode data carry the most weight because they affect the most people, the most often.
| Criterion | Weight | What it measures |
|---|---|---|
| Data accuracy | 16% | How close the app’s calorie and portion estimates land versus weighed reference values across our 1,400-dish dataset. |
| International food and barcode data | 13% | Coverage and accuracy of foods from all 24 countries in the dataset, plus barcode data for packaged products. |
| Speed | 10% | Time and friction to capture a meal by photo, voice, barcode, or text and reach a correct, logged entry. |
| App user experience design | 11% | Onboarding, clarity, and the day-to-day quality of the interface across tech-comfort levels. |
| AI nutritional guidance | 12% | Quality, safety, and evidence-basis of the AI nutrition guidance, including telling you what to eat next. |
| Meal and workout planning | 9% | Strength of meal-planning and workout-planning tools against your goals and remaining daily targets. |
| Healthy alternative provisions | 8% | How usefully the app suggests healthier swaps and alternatives for the foods you log or plan. |
| Allergy and restrictions customization | 8% | Depth of allergy, intolerance, medical, and dietary-restriction customization the app supports. |
| Chart visualization | 5% | Clarity of charts and trends, and whether the visuals actually help someone make a decision. |
| AI native implementation | 8% | How deeply AI is built into the core logging and coaching experience, rather than bolted on. |
Who tests each part of an app?
No single person is qualified to judge everything from micronutrient accuracy to coaching tone, so each criterion is owned by the reviewer best suited to it. Our registered dietitian owns nutritional guidance and medical-diet handling; our data engineer owns the accuracy, database, and barcode tests; our coaching editor runs multi-week adherence and wearable trials; our UX editor times logging tasks and audits support. The editor-in-chief reviews every scored category before publication.
How long do we test before we publish?
Every app goes through the dataset benchmark plus a minimum of three weeks of daily, real-life use across a tester panel of different ages and tech-comfort levels. We log common meals by photo, barcode, voice, and text, and time each path. We submit real support tickets and record whether — and how — each app responds. Adaptive features, like targets that adjust to activity, are tested with synced wearables over the full trial.
How do we handle ties and close calls?
When two apps land within a tenth of a point, we do not invent precision we do not have. The editor-in-chief decides whether there is enough evidence to separate them, and if not, we say so in the review. Scores can move between cycles as apps ship updates; we date and log every change.
How do we keep this honest over time?
Methodology is a living document. When we change a weight, add a criterion, or refresh the dataset, we note it here with a date. If we get something wrong, we correct it in public and say what changed. The goal is simple: a reader should be able to reconstruct how any number on this site was earned.