聊天 BCG:人工智能可以阅读您的幻灯片吗?
摘要
GPT4o 和 Gemini Flash 等多模态模型在推理和总结任务方面表现出色,其性能接近人类水平。 然而,我们发现,当被要求执行非常具体的“阅读和估计”任务时,这些模型的表现低于人类,特别是在商业平台中的可视化图表的背景下。 本文评估了 GPT 4o 和 Gemini Flash-1.5 在回答有关标记图表(其中数据在图表上清楚注释)和未标记图表(其中数据未明确注释并且必须从图表中推断)上的数据的简单问题时的准确性。 X 和 Y 轴)。
在标记图表上,我们发现 GPT 4o 和 Gemini Flash 在特定类型的图表(一张图表中包含多个图表、堆叠图表、瀑布图表)上始终不准确。 人为错误率估计低于 5%,而 GPT-4o 和 Gemini Flash 的错误率分别为 16% 和 14% 111以完全匹配真值来源的数据点百分比来衡量。, 分别。 这两个大语言模型都经常犯类似的错误,例如始终将“3”误读为“8”或将负数错误地标记为正数。 从图表来看,这两个模型的表现都没有始终优于另一个。
同样,在无标签图表上,GPT 和 Gemini 都有类似的故障点。 这些模型的任务是通过读取 X 轴和 Y 轴来估计未标记图表中的数字。 正如预期的那样,完全匹配的答案很低,Gemini 的错误率高达 79%,GPT 4o 的错误率高达 83%。 在这里,我们与(人类读取的)真值来源相比,测量了每个模型的误差大小。 误差很大,Gemini Flash-1.5 的平均偏差为 53%,GPT-4o 的平均偏差为 55% 222以平均绝对百分比差异衡量(有关公式详细信息,请参阅方法)。,相比之下,人为误差估计为 10-20%。 导致偏差的原因始终是估计中的小错误,但也有模型误读标签/数字而导致的更大偏差。 例如,GPT 4o 有一次将 2015 年误读为 2009 年,并估计了错误年份的数字。
我们的结论是,如果甲板包含任何复杂或未标记的图表,这些模型目前无法准确地端到端读取甲板。 即使用户创建了一组仅带标签的图表,该模型也只能完美地端到端读取 15 个带标签的图表中的 7-8 个。
1简介
凭借 GPT-4o 和 Gemini Flash 的先进视觉功能,在实际业务应用中这些功能的准确性出现了一个重要问题。 我们的假设是多模态模型擅长阅读和总结图表。 当给定幻灯片图像时,他们可以很好地总结其中的关键见解,通常包括相关数据点。
针对该问题的现有研究评估了大语言模型在解析表[3]时的功效,得出的结论是大语言模型对驱动性能的输入提示高度敏感。 其他作品还评估了大语言模型推理和阅读数学图表的能力[2],并发现 GPT 模型优于替代模型。
本文旨在探讨多模态模型在该技能的变体上是否表现良好 - 回答需要模型从幻灯片中选出数字的直接问题。 我们通过询问模型有关直接打印在图表(如果有标签)上的数据的问题或要求模型估计图表中的数据点(如果没有标签)来测试这一点。 我们确保模型不需要执行任何数学计算。 然后,我们测量不同类型图表的响应准确性,以回答以下具体问题:
-
•
具有高级视觉功能的多模式模型从标记图表中读取数据的准确度如何? 一种模型相对于另一种模型是否具有一致的准确性优势?
-
•
模型从未标记的图表中估计数值数据的准确度如何? 平均而言,他们的估计有多“错误”?
在下一节中,我们将深入研究不同图表类型的错误率,分为两类:标记的和未标记的。
2 多模式模型读取图表的准确度如何?
2.1方法论
我们抽取了 31 个图表并将它们分为 2 大类:
-
•
带标签(15 个图表):此类别包括简单条形图、折线图和多条形图,其中每个数据点都明确打印在图表上。 我们还包括一些更复杂的标记图表,例如瀑布图、堆积图、气泡图和连接条形图。
-
•
无标签(16 个图表):此类别主要关注没有明确打印数据点且需要通过“读取相对于 X 和 Y 轴的位置”来估计数据的图表。 它不包括根本没有任何比例的图表。 该数据集包括更简单的无标签图表,如条形图、折线图、混合图和气泡图,以及一些更复杂的无标签图表,如月亮图和点图。
对于每个图表,我们创建了一个问题数据集。 这些问题仅限于 3 种类型:
-
•
识别特定数据点
-
•
识别最大/最小数据点
-
•
计算数据点的数量
这些问题的目标是测试直接从图表读取和解释数据的能力,而不需要任何密集的计算。
然后,我们根据两个指标评估两个模型的响应:
-
•
匹配率%:此指标用于标记图表,其中模型读取图表上打印的数据,但同时报告两者。 计算公式为 -
-
•
平均绝对误差 (MAE):此指标针对未标记的图表报告,其中模型“估计”图表中的数字。 它测量一组预测中误差的平均大小,而不考虑它们的方向。 计算如下:
在哪里:
-
–
是数据点的数量,
-
–
是第 个数据点的实际值,
-
–
是第 个数据点的预测值。
-
–
-
•
平均绝对百分比误差 (MAPE):此指标针对未标记的图表报告,其中模型正在“估计”图表中的数字。 它可以理解为模型的答案与正确答案的差异程度,标准化为百分比。 计算如下:
在哪里:
-
–
是数据点的数量,
-
–
是第 个数据点的实际值,
-
–
是第 个数据点的预测值。
-
–
然后,在所有图表的聚合级别上计算这些指标,并分别报告标记和未标记的数据集。
2.1.1 标记图表的方法演练
例如,在图 1(如下)的标记图表中,我们生成了问题并测量了匹配百分比,如表 1 所示
Question | Correct Answer | GPT Answer | Gemini Answer |
---|---|---|---|
what % of the employee workforce is white in FY23? | 55 | 55 | 55 |
what % of the employee workforce is female in FY21? | 49 | 49 | 51 |
what % of the employee workforce is ’two or more races’ in FY22? | 3 | 3 | 8 |
what % of the employee workforce elected not to provide their race in FY21? | 4 | 4 | 4 |
what % of the employee workforce is Asian in FY23? | 22 | 22 | 7 |
Match % | 100% | 60% |
2.1.2 未标记图表的方法演练
对于未标记的图表(参见下面的图 2),我们生成了问题并估计了 MAPE,如表 2 所示
Question | Correct Answer | GPT Answer | Gemini Answer |
---|---|---|---|
How many deals happened in 2018? | 30 | 20 | 10 |
How much funding (in millions) was deployed in 2021? | 1500 | 1500 | 1500 |
What was the deal count in 2019? | 40 | 20 | 40 |
How much funding (in millions) was deployed in 2017? | 100 | 0.5 | 10 |
What was the highest deal count between 2017 and 2022? | 110 | 120 | 110 |
What was the highest funding (in millions) between 2017 and 2020? | 1450 | 1000 | 300 |
MAPE % | 37% | 39% |
2.2调查结果摘要
使用上述方法,下表报告了 31 个图表样本中标记和未标记图表的总体匹配 % 和 MAPE %:
Chart Type | GPT Match % | Gemini Match % |
Labeled Charts | 84% | 86% |
Chart Type | GPT MAPE % | Gemini MAPE % |
Unlabeled Charts | 55% | 53% |
2.3 标记图表:图表之间的匹配率是多少?
在 15 个图表和总共 71 个问题中,GPT 4o 完美地端到端读取了 8 个图表,Gemini Flash-1.5 完美地读取了 7 个图表。 在出现错误的图表中,模型在所有问题上都不会出错。 这两种模型在一些数据点上都会出现较大的错误,这通常是由于误读标签造成的。 由于这些不是估计误差,因此误差范围不稳定,如表 5 所示。 这两种模型都难以处理存在多个数字的图表。
Chart Name | Gemini Match % | GPT Match % | Gemini Error Range | GPT Error Range |
---|---|---|---|---|
stacked bar chart | 100% | 80% | (0, 0) | (0, 5) |
bar and table chart | 100% | 100% | (0, 0) | (0, 0) |
piechart | 100% | 100% | (0, 0) | (0, 0) |
simple table | 100% | 100% | (0, 0) | (0, 0) |
icons on chart | 100% | 100% | (0, 0) | (0, 0) |
stacked piechart | 100% | 100% | (0, 0) | (0, 0) |
world map chart | 100% | 100% | (0, 0) | (0, 0) |
bar chart diagonal labels | 100% | 75% | (0, 0) | (0, 2) |
connected bar graphs | 86% | 100% | (0, 20) | (0, 0) |
horizontal stacked bar chart | 80% | 100% | (0, 15) | (0, 0) |
positive-negative bar charts | 80% | 80% | (0, 878) | (0, 178) |
waterfall chart | 80% | 60% | (0, 194) | (0, 882) |
eight multi-bar charts | 75% | 50% | (0, 1) | (0, 22) |
two figures: stacked multi-bar | 67% | 67% | (0, 4) | (0, 4) |
two figures - multi-bars | 20% | 40% | (0, 52) | (0, 46) |
Overall | 86% | 84% |
GPT 4o 故障点
在某些情况下,4o 会误解与堆叠图表中给定标签相关的数据。 在“两位数字:堆叠式多栏”中,当询问 2012 年 Athleta 商店数量时,它返回了 Microsoft 商店的数量。 同样,在“八个多条形图”中,它错误地返回了日本的糖尿病率而不是肥胖率
在其他情况下,4o 错误地识别了较低分辨率图表上的数字。 例如,在“堆积条形图”中,当毕马威要求确定公共部门出版物的百分比时,GPT 4o 将“3”误读为“8”,而 Gemini 没有犯这个错误。
双子座的失败点
与 4o 类似,Gemini 错误是由误解标签引起的,但也包含误解问题的因素,而 GPT 4o 并不经常表现出这一点。 例如,在“两位数字 - 多条形”中,当被问及融资来源中有多少%来自私人投资(EPR 除外)时,它会从错误的条形图中回答。 另一方面,4o 从正确的表中得到答案,但该表中的标签错误。 虽然超出了这些错误率的范围,但我们还注意到,与 Gemini 相比,4o 更擅长回答一些推理/计算问题
2.4 未标记的图表:图表之间的误差大小是多少?
在 15 个未标记图表中的 82 个问题中,GPT 4o 和 Gemini Flash 的平均绝对百分比误差均达到 55%。
Chart Name | Gemini MAPE % | GPT MAPE % |
stacked bar low res | 226 | 131 |
bubble chart | 147 | 177 |
two figures chart | 113 | 155 |
stacked bar and pie chart | 62 | 101 |
single line chart | 49 | 37 |
bar and line | 45 | 68 |
line on bar 2 | 39 | 37 |
simple line on bar | 35 | 8 |
tagged dots chart | 33 | 29 |
range line chart | 32 | 10 |
comparison bar charts | 29 | 24 |
seven line chart | 20 | 18 |
moon chart | 13 | 13 |
dots chart | 12 | 37 |
black line chart | 8 | 7 |
three line chart | 6 | 24 |
Overall | 52.79 | 54.64 |
观察结果
Gemini 和 GPT 4o 都难以处理包含多个未标记数字的图表,错误率均达到 >100%(分别为 113% 和 155%)。 这是由该图表的许多/所有问题中的高幅度错误造成的。 另一个常见的失败点是堆叠图表,其中中间数字的估计需要估计图表中的所有标签(此处,高分辨率图表的错误率高达 85%,低分辨率图表的错误率高达 230%)
这两种模型也都在处理堆积条形图,其中估计数字可能需要隔离子条形图。 例如,在“堆叠条形图和饼图”中,两个模型都以条形的总值而不是特定子条形标签的值进行响应(50,000 与正确答案 35,000)
两个模型都表现最佳在复杂的三折线图上,折线图的偏差率低至 6%。
附录中给出了图表问题和错误的详细分类。
3结论
虽然 GPT-4o 和 Gemini Flash 1.5 在读取图表方面都表现出了许多先进的功能,但我们的评估揭示了它们当前性能的一些一致的局限性。 尽管比其他替代方案更好,但这些模型仍然需要人工监督才能达到可接受的准确性水平。 对于标记图表,模型的平均错误率为 15%,这可能不适合高风险的业务应用程序。 未标记图表上的性能尤其不一致,对于更复杂的视觉效果,错误率超过 100%。 当使用代表典型商业套牌的 30 张综合图表进行评估时,模型只能以 100% 匹配准确度读取 7-8 个图表(所有问题均正确)。 因此,对于任何要求高精度的用例,这些模型尚未准备好在没有人工干预的情况下运行。
4附录
4.1 标记图表:问题和模型回答
Chart Type | Question | Correct Answer | Gemini Flash Answer | GPT 4o Answer |
---|---|---|---|---|
two figures: stacked multi-bar | ||||
How many stores did Warby Parker open by 2013? | 13 | 13 | 13 | |
In 2012, how many stores did Athleta have? | 35 | 35 | 31 | |
In 2013, how many stores did Boston Proper have? | 4 | 8 | 4 | |
In 2009, how many stores did Microsoft have? | 2 | 2 | 0 | |
What is the % of capex to store related investments in 2012? | 71 | 71 | 71 | |
What was the compound annual growth rate of the % of capex to store related investments from 2011 to 2013? | 6 | 4 | 6 | |
Perfect Match % | 67 | 67 | ||
stacked bar chart | ||||
What % of retail and distribution publications did PwC publish from 2013-15? | 6 | 6 | 6 | |
How many industry publications did KPMG have from 2013-15? | 626 | 626 | 626 | |
What % of retail and distribution publications did EY publish from 2013-15? | 4 | 4 | 4 | |
What % of public sector publications did Deloitte publish from 2013-15? | 9 | 9 | 9 | |
What % of public sector publications did KPMG publish from 2013-15? | 3 | 3 | 8 | |
Perfect Match % | 100 | 80 | ||
connected bar graphs | ||||
What is the projected total global fintech revenue (in $ billion) for 2030? | 1500 | 1500 | 1500 | |
How much of the global fintech revenue in 2021 came from the banking sector? | 225 | 245 | 225 | |
By how many times is the global fintech revenue expected to grow from 2021 to 2030? | 6 | 6 | 6 | |
What is the projected fintech penetration of banking revenues (%) for 2030? | 13 | 13 | 13 | |
What was the fintech penetration of banking valuations (%) in 2021? | 9 | 9 | 9 | |
What is the projected revenue for the insurance sector in 2030 in billion dollars? | 200 | 200 | 200 | |
What was the total global fintech revenue in 2021 in billion dollars? | 245 | 245 | 245 | |
Perfect Match % | 86 | 100 | ||
eight multi-bar charts | ||||
What is the prevalence of high blood pressure in France? | 28.5 | 28.5 | 28.5 | |
How much more prevalent is diabetes in France compared to India? | 2.7 | 2.2 | 3.9 | |
What is the prevalence of obesity in Brazil? | 22.3 | 22.3 | 22.3 | |
What is the prevalence of obesity in Japan? | 4.4 | 4.4 | 26.7 | |
Perfect Match % | 75 | 50 | ||
two figures - multi-bars | ||||
What percentage of financing sources that supported the implementation of the solution came from private investment (other than EPR)? | 42 | 0 | 16 | |
What is the percentage of governmental funding that supported the implementation of the solution? | 16 | 21 | 16 | |
What percentage of financing for scaling the solution came from blended financing? | 21 | 0 | 21 | |
How much did philanthropic funding contribute to the scaling of the solution in percentage terms? | 21 | 21 | 0 | |
What is the combined percentage of multilateral funding for both the implementation and scaling of the solution? | 11 | 63 | 57 | |
Perfect Match % | 20 | 40 | ||
bar and table chart | ||||
What is the overall unbanked percentage? | 27 | 27 | 27 | |
What is the cash usage in APAC? | 59 | 59 | 59 | |
How many underbanked adults in Europe? | 357 | 357 | 357 | |
How many more unbanked adults in APAC than in Europe? | 766 | 766 | 766 | |
Perfect Match % | 100 | 100 | ||
piechart | ||||
What percentage of TMT executives are ready to disclose Scope 1 emissions? | 69 | 69 | 69 | |
What percentage of TMT executives are ready to disclose all other indirect emissions that occur in the company’s value chain? | 34 | 34 | 34 | |
What percentage of executives overall are ready to disclose Scope 3 emissions? | 37 | 37 | 37 | |
What is the difference in the % of TMT executives willing to disclose Scope 2 emissions and % of executives overall willing to disclose Scope 2 emissions? | 7 | 7 | 7 | |
Perfect Match % | 100 | 100 | ||
horizontal stacked bar chart | ||||
What % of the employee workforce is white in F&23? | 55 | 55 | 55 | |
What % of the employee workforce is female in F&21? | 49 | 49 | 49 | |
What % of the employee workforce is ’two or more races’ in F&22? | 3 | 3 | 3 | |
What % of the employee workforce elected not to provide their race in FY21? | 4 | 4 | 4 | |
What % of the employee workforce is Asian in F&23? | 22 | 7 | 22 | |
Perfect Match % | 80 | 100 | ||
simple table | ||||
What is the trades per month in market C? | 10 | 10 | 10 | |
What are the net proceeds in market A? | 46 | 46 | 46 | |
What is the price in Market B? | 48 | 48 | 48 | |
What is the annual volume in market B? | 12000 | 12000 | 12000 | |
What is the net proceed in market B? | 43 | 43 | 43 | |
Perfect Match % | 100 | 100 | ||
icons on charts | ||||
What was the NPS of Netflix? | 68 | 68 | 68 | |
What was the NPS of Amazon? | 62 | 62 | 62 | |
What was the NPS of Spotify? | 54 | 54 | 54 | |
What was the NPS of Life? | 26 | 26 | 26 | |
What was the NPS of Apple? | 68 | 68 | 68 | |
Perfect Match % | 100 | 100 | ||
positive-negative bar charts | ||||
What is the change in medical claims spent per participant (in $) for ’Hello Heart Users’ in the surgery/inpatient service category? | -482 | -482 | -482 | |
What is the change in medical claims spent per participant (in $) for ’Matched Control’ group in the diagnostics service category? | 580 | -298 | 580 | |
What is the change in medical claims spent per participant (in $) for ’Matched Control’ group in the physician visit service category? | 40 | 40 | 218 | |
What is the reduction in total medical costs per participant per year (in $)? | 1865 | 1865 | 1865 | |
What is the total change in medical claims spent per participant (in $) for ’Hello Heart Users’? | -880 | -880 | -880 | |
Perfect Match % | 80 | 80 | ||
waterfall chart | ||||
What is the starting IH09? (in $ millions) | 1911 | 1911 | 1911 | |
What is the deduction for investing activities (in $ millions)? | 97 | -97 | -97 | |
What is the addition for cash capex (in $ millions)? | 441 | 441 | -441 | |
What is the pension contribution (in $ millions)? | -187 | -187 | -187 | |
What is the cash from operations amount (in $ millions)? | 285 | 285 | 285 | |
Perfect Match % | 80 | 60 | ||
world map chart | ||||
What is the % of women in the Americas? | 49 | 49 | 49 | |
What is the number of people in Asia Pacific? | 56386 | 56386 | 56386 | |
What is the number of people in EMA? | 142368 | 142368 | 142368 | |
What is the total number of member firms? | 143 | 143 | 143 | |
What is the overall % of women globally? | 48.5 | 48.5 | 48.5 | |
What is the % of women in Asia-Pacific? | 54 | 54 | 54 | |
Perfect Match % | 100 | 100 | ||
stacked piechart | ||||
What is the % share of digital engagement in the personal care category? | 17 | 17 | 17 | |
What is the % share of digital engagement in the computer software category? | 98 | 98 | 98 | |
What is the % share of digital engagement in the food and drink category? | 3 | 3 | 3 | |
What % of middle east and africa uses digital banking? | 17 | 17 | 17 | |
What is the % share of digital engagement in the consumer banking category? | 58 | 58 | 58 | |
Perfect Match % | 100 | 100 | ||
bar chart diagonal labels | ||||
What is the accuracy % for resort in Graph Layer = 2 | 86.7 | 86.7 | 88.4 | |
What is the accuracy % for viaduct in Graph Layer = 1 | 84 | 84 | 84 | |
What is the accuracy % for forest in Graph Layer = 3 | 84.9 | 84.9 | 84.9 | |
What is the accuracy % for forest in Graph Layer = 1 | 83.9 | 83.9 | 83.9 | |
Perfect Match % | 100 | 75 |
4.2 未标记图表:问题和模型响应
Chart Type | Question | Correct Answer | Gemini Answer | OpenAI Answer |
---|---|---|---|---|
bar and line | ||||
What was the highest revenue growth rate % seen post the financial crisis? | 11 | 10 | 8 | |
What was the lowest revenue growth rate % seen before the financial crisis? | 5 | 10 | 8 | |
What was the lowest R&D growth rate % seen before the financial crisis? | 10 | 5 | 5 | |
What was the revenue growth rate in 2009? | 4 | 4 | 0 | |
What was the revenue growth rate in 2019? | 11 | 10 | 6 | |
What was the R&D growth rate in 2015? | 7.5 | 0 | -2 | |
Mean Absolute Error (MAE) | 3.25 | 4.92 | ||
Mean Absolute Percentage Error (MAPE) % | 44.70 | 68.23 | ||
bubble chart | ||||
What is the change in disease burden between 2020 and 2040 for nutritional deficiencies? | -20 | -20 | -20 | |
What is the minimum change in disease burden between 2020 and 2040 for infectious diseases? | -5 | -40 | -40 | |
What is the highest increase in disease burden expected from a disease in the ’other diseases’ category? | 60 | 40 | 50 | |
What is the change in disease burden between 2020 and 2040 for neglected tropical disease and malaria? | -30 | -30 | -40 | |
How many bubbles which represents an ’infectious disease’ have a change in disease burden that ranges from -20 to -40? | 3 | 3 | 4 | |
Mean Absolute Error (MAE) | 11 | 15.20 | ||
Mean Absolute Percentage Error (MAPE) % | 146.67 | 176.67 | ||
line on bar 2 | ||||
How many deals happened in 2018? | 30 | 10 | 20 | |
How much funding (in millions) was deployed in 2021? | 1500 | 1500 | 1500 | |
What was the deal count in 2019? | 40 | 40 | 20 | |
How much funding (in millions) was deployed in 2017? | 100 | 10 | 0.5 | |
What was the highest deal count between 2017 and 2022? | 110 | 110 | 120 | |
What was the highest funding (in millions) between 2017 and 2020? | 1450 | 300 | 1000 | |
Mean Absolute Error (MAE) | 210 | 98.25 | ||
Mean Absolute Percentage Error (MAPE) % | 39.33 | 37.16 | ||
tagged dots chart | ||||
What % of survey respondents think Advanced IT will be a widely used skill in the future? | 35 | 26 | 30 | |
What % of survey respondents think Basic IT will be a future skill needed? | 15 | 26 | 30 | |
What percentage of respondents think ’Entrepreneurship’ will be an expected future skill needed? | 27 | 18 | 20 | |
How many categories of skills are included in the limited but growing quadrant? | 3 | 4 | 3 | |
How many skills are in the ’limited and stable’ quadrant of this chart? | 4 | 5 | 5 | |
What percentage of respondents said that ’complex information processing’ will be an expected future skill needed? | 23 | 25 | 30 | |
What percentage of respondents said that ’complex information processing’ is a most widely used skill today? | 22 | 15 | 20 | |
Mean Absolute Error (MAE) | 5.71 | 5.29 | ||
Mean Absolute Percentage Error (MAPE) % | 33.03 | 29.25 | ||
moon chart | ||||
What is the relative applicability of Machine Learning (AI) (as a %) in the negotiation stage of the procurement process? | 100 | 100 | 100 | |
What is the relative applicability of Machine Learning (AI) (as a %) in the invoice checking stage of the procurement process? | 25 | 25 | 25 | |
What is the relative applicability of Automation (as a %) in the payment processing stage of the procurement process? | 100 | 100 | 100 | |
What is the relative applicability of Automation (as a %) in the RFX process stage of the procurement process? | 75 | 25 | 25 | |
What is the relative applicability of cognitive agents (as a %) in the ’Buyer’ stage of the procurement process? | 25 | 25 | 25 | |
Mean Absolute Error (MAE) | 10 | 10 | ||
Mean Absolute Percentage Error (MAPE) % | 13.33 | 13.33 | ||
black line chart | ||||
What was the traction in Sep ’15 (in $ thousands)? | 60 | 55 | 60 | |
What was the traction in Nov ’15 (in $ thousands)? | 75 | 80 | 80 | |
What was the traction in Jul ’15 (in $ thousands)? | 48 | 40 | 60 | |
What was the highest traction between May ’15 and May ’16 (in $ thousands)? | 140 | 130 | 137.3 | |
What was the % increase in traction in the ’Last 30 days’? | 22 | 22 | 22 | |
Mean Absolute Error (MAE) | 5.60 | 3.93 | ||
Mean Absolute Percentage Error (MAPE) % | 7.76 | 6.71 | ||
two figures chart | ||||
What % of respondents say their frequency of use is 1-3 times a month because of airline miles? | 4 | 18 | 25 | |
What % of respondents say their duration of most used credit card is 10 years or more because of low interest rates? | 23 | 10 | 20 | |
What % of respondents say their frequency of use is 1-2 times a week because of unlimited cash back? | 14 | 13 | 10 | |
What is the highest % of respondents who say their duration of most used credit card is 1 year but less than 2 years? | 12 | 28 | 35 | |
What is the lowest % of respondents who say their frequency of use is 7-10 times a week? | 6 | 5 | 5 | |
Mean Absolute Error (MAE) | 9 | 10.40 | ||
Mean Absolute Percentage Error (MAPE) % | 112.73 | 154.99 | ||
seven line chart | ||||
What were the influenza rates in Feb 2016/17? | 28000 | 25000 | 30000 | |
What were the influenza rates in Apr 2014/15? | 8000 | 6000 | 5000 | |
What were the influenza rates in Jan 2015/16? | 10000 | 10000 | 10000 | |
What was the average influenza rate in Mar between 2014-19? | 27000 | 15000 | 20000 | |
Mean Absolute Error (MAE) | 4250 | 3000 | ||
Mean Absolute Percentage Error (MAPE) % | 20.04 | 17.64 | ||
single line chart | ||||
What was the growth projection on 12/29/2003? | 80000 | 150000 | 100000 | |
What was the growth projection on 5/17/2004? | 210000 | 400000 | 300000 | |
What was the actual data on 4/19/2004? | 480000 | 400000 | 102000 | |
What was the highest actual growth between 5/03 - Present? | 900000 | 900000 | 900000 | |
Mean Absolute Error (MAE) | 85000 | 122000 | ||
Mean Absolute Percentage Error (MAPE) % | 48.66 | 36.65 | ||
stacked bar low res | ||||
What were Origin’s ’load and bank guarantee - undrawn’ (in $ million) in 2011? | 800 | 1500 | 400 | |
What were Origin’s ’US private placement’ (in $ million) in 2014? | 180 | 400.0 | 0.0 | |
What were Origin’s ’load and bank guarantee - undrawn’ (in $ million) in 2010? | 90 | 400 | 200 | |
What were Origin’s ’load and bank guarantee - drawn’ (in $ million) in 2012? | 1100 | 1500 | 200 | |
What was Origin’s highest US private placement (in $ million) between 2010 and 2020? | 250 | 1600 | 1000 | |
Mean Absolute Error (MAE) | 596 | 468 | ||
Mean Absolute Percentage Error (MAPE) % | 226.11 | 130.81 | ||
three lines chart | ||||
What was the total return to shareholders by non resilient companies in 2015? | 120 | 120 | 150 | |
What was the total return to shareholders by resilient companies in 2013? | 230 | 240 | 175 | |
What was the total return to shareholders by resilient companies at the end of recovery phase? | 150 | 150 | 200 | |
What was the total return to shareholders by S&P 500 companies at the end of growth phase? | 210 | 250 | 200 | |
What was the total return to shareholders by S&P 500 companies at the end of downturn phase? | 75 | 70 | 100 | |
What was the total return to shareholders by resilient companies in 2014? | 250 | 250 | 200 | |
What was the total return to shareholders by non-resilient companies in 2010? | 80 | 90 | 100 | |
Mean Absolute Error (MAE) | 9.29 | 34.29 | ||
Mean Absolute Percentage Error (MAPE) % | 6.08 | 23.62 | ||
stacked bar and pie chart | ||||
What is the lead supply ’Today’ in Type 1 channel? | 35000 | 50000 | 50000 | |
What is the lead supply in ’12 months’ in Type 1 channel? | 90000 | 100000 | 200000 | |
What is the lead supply in ’12 months’ in Type 4 channel? | 25000 | 30 | 50000 | |
What is the lead supply in ’Today’ in Type 4 channel? | 10000 | 20 | 50000 | |
What % of loan distribution is with Lender 4? | 9 | 18 | 5 | |
What % of loan distribution is with Lender 2? | 34 | 34 | 34 | |
What % of loan distribution is with ’Other’ Lender? | 5 | 9 | 5 | |
Mean Absolute Error (MAE) | 8566.14 | 27143.43 | ||
Mean Absolute Percentage Error (MAPE) % | 61.95 | 101.36 | ||
dots chart | ||||
What’s the percentage of the population above empowerment line in Israel in 2022? | 73 | 70 | 75 | |
What’s the GDP per capita 2022 in the US? (in $ thousands) | 75 | 65 | 70 | |
What’s the GDP per capita 2022 in Ireland (in $ thousands)? | 105 | 90 | 80 | |
What’s the percentage of the population above empowerment line in Ireland in 2022? | 73 | 70 | 75 | |
What’s the highest percentage of the population above empowerment line for a country with empowerment line set at $12 PPP floor? | 40 | 50 | 100 | |
Mean Absolute Error (MAE) | 8.20 | 18.80 | ||
Mean Absolute Percentage Error (MAPE) % | 12.17 | 37.19 | ||
comparison bar charts | ||||
What was the instrument handling performance score when using Osso VR Training? | 4 | 3 | 5 | |
What was the total performance score when using Standard training? | 7.5 | 8 | 6 | |
What was the knowledge of instruments performance score when using Osso VR Training? | 3.5 | 2 | 5 | |
What was the time and motion performance score when using standard training? | 2.5 | 1 | 2 | |
What was the total performance score when using Osso VR training? | 17.5 | 19 | 20 | |
Mean Absolute Error (MAE) | 1.20 | 1.40 | ||
Mean Absolute Percentage Error (MAPE) % | 28.62 | 24.43 | ||
simple line on bar | ||||
What was the number of mocap systems (in thousands) in the market in 2024? | 75 | 87.5 | 80 | |
What was the number of mocap systems (in thousands) in the market in 2022? | 15.0 | 20 | 20 | |
What was the number of assets created by users (in million) per year in 2022? | 22.5 | 0.02 | 22.5 | |
What was the number of assets created by users (in million) per year in 2023? | 44 | 40 | 45 | |
What was the highest number of assets created by users (in million) between 2021 and 2024? | 80 | 67.5 | 80 | |
Mean Absolute Error (MAE) | 11.30 | 2.20 | ||
Mean Absolute Percentage Error (MAPE) % | 34.93 | 8.45 | ||
range line chart | ||||
What percentage of respondents prefer in-store experience during the purchase step of shopping journey for the furniture category? | 78 | 39 | 80 | |
What percentage of respondents prefer in-store experience during the discovery step of shopping journey for the computer and electronics category? | 30 | 57 | 30 | |
What percentage of respondents prefer in-store experience during the pickup step of shopping journey for the furniture category? | 35 | 38 | 30 | |
What is the average in-store preference (percentage of respondents) during the trial step of the shopping journey? | 80 | 50 | 60 | |
What is the lowest in-store preference % in the pickup step of the shopping journey? | 35 | 36 | 30 | |
What is the highest in-store shopping preference % in the trial step of the shopping journey? | 85 | 80 | 80 | |
Mean Absolute Error (MAE) | 17.50 | 6.17 | ||
Mean Absolute Percentage Error (MAPE) % | 32.47 | 10.34 |
参考
- [1] B. Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs, arXiv, 2023. https://arxiv.org/pdf/2305.18641.
- [2] D. Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, arXiv, 2023. https://arxiv.org/pdf/2308.11224.
- [3] E. Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study, arXiv, 2023. https://arxiv.org/pdf/2305.13062.
- [4] F. Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey, arXiv, 2024. https://arxiv.org/pdf/2404.01869.
- [5] Y. UniChart, "A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning", arXiv, 2023. https://arxiv.org/pdf/2305.14761.
- [6] X. Zhang, Y. Xie, H. Zhang, et al. "A Survey on Evaluation of Large Language Models." arXiv preprint arXiv:2307.03109, 2023. Available: https://arxiv.org/pdf/2307.03109