聊天 BCG：人工智能可以阅读您的幻灯片吗？

Nikita Singh
niks19@seas.upenn.edu Rob Balian
Reprompt AI
rob@repromptai.com Lukas Martinelli
Reprompt AI
lukas@repromptai.com

摘要

GPT4o 和 Gemini Flash 等多模态模型在推理和总结任务方面表现出色，其性能接近人类水平。然而，我们发现，当被要求执行非常具体的“阅读和估计”任务时，这些模型的表现低于人类，特别是在商业平台中的可视化图表的背景下。本文评估了 GPT 4o 和 Gemini Flash-1.5 在回答有关标记图表（其中数据在图表上清楚注释）和未标记图表（其中数据未明确注释并且必须从图表中推断）上的数据的简单问题时的准确性。 X 和 Y 轴）。

在标记图表上，我们发现 GPT 4o 和 Gemini Flash 在特定类型的图表（一张图表中包含多个图表、堆叠图表、瀑布图表）上始终不准确。人为错误率估计低于 5%，而 GPT-4o 和 Gemini Flash 的错误率分别为 16% 和 14% ¹¹1以完全匹配真值来源的数据点百分比来衡量。，分别。这两个大语言模型都经常犯类似的错误，例如始终将“3”误读为“8”或将负数错误地标记为正数。从图表来看，这两个模型的表现都没有始终优于另一个。

同样，在无标签图表上，GPT 和 Gemini 都有类似的故障点。这些模型的任务是通过读取 X 轴和 Y 轴来估计未标记图表中的数字。正如预期的那样，完全匹配的答案很低，Gemini 的错误率高达 79%，GPT 4o 的错误率高达 83%。在这里，我们与（人类读取的）真值来源相比，测量了每个模型的误差大小。误差很大，Gemini Flash-1.5 的平均偏差为 53%，GPT-4o 的平均偏差为 55% ²²2以平均绝对百分比差异衡量（有关公式详细信息，请参阅方法）。，相比之下，人为误差估计为 10-20%。导致偏差的原因始终是估计中的小错误，但也有模型误读标签/数字而导致的更大偏差。例如，GPT 4o 有一次将 2015 年误读为 2009 年，并估计了错误年份的数字。

我们的结论是，如果甲板包含任何复杂或未标记的图表，这些模型目前无法准确地端到端读取甲板。即使用户创建了一组仅带标签的图表，该模型也只能完美地端到端读取 15 个带标签的图表中的 7-8 个。

1简介

凭借 GPT-4o 和 Gemini Flash 的先进视觉功能，在实际业务应用中这些功能的准确性出现了一个重要问题。我们的假设是多模态模型擅长阅读和总结图表。当给定幻灯片图像时，他们可以很好地总结其中的关键见解，通常包括相关数据点。

针对该问题的现有研究评估了大语言模型在解析表[3]时的功效，得出的结论是大语言模型对驱动性能的输入提示高度敏感。其他作品还评估了大语言模型推理和阅读数学图表的能力[2]，并发现 GPT 模型优于替代模型。

本文旨在探讨多模态模型在该技能的变体上是否表现良好 - 回答需要模型从幻灯片中选出数字的直接问题。我们通过询问模型有关直接打印在图表（如果有标签）上的数据的问题或要求模型估计图表中的数据点（如果没有标签）来测试这一点。我们确保模型不需要执行任何数学计算。然后，我们测量不同类型图表的响应准确性，以回答以下具体问题：

•

具有高级视觉功能的多模式模型从标记图表中读取数据的准确度如何？一种模型相对于另一种模型是否具有一致的准确性优势？
•

模型从未标记的图表中估计数值数据的准确度如何？平均而言，他们的估计有多“错误”？

在下一节中，我们将深入研究不同图表类型的错误率，分为两类：标记的和未标记的。

2 多模式模型读取图表的准确度如何？

2.1方法论

我们抽取了 31 个图表并将它们分为 2 大类：

•

带标签（15 个图表）：此类别包括简单条形图、折线图和多条形图，其中每个数据点都明确打印在图表上。我们还包括一些更复杂的标记图表，例如瀑布图、堆积图、气泡图和连接条形图。
•

无标签（16 个图表）：此类别主要关注没有明确打印数据点且需要通过“读取相对于 X 和 Y 轴的位置”来估计数据的图表。它不包括根本没有任何比例的图表。该数据集包括更简单的无标签图表，如条形图、折线图、混合图和气泡图，以及一些更复杂的无标签图表，如月亮图和点图。

对于每个图表，我们创建了一个问题数据集。这些问题仅限于 3 种类型：

•

识别特定数据点
•

识别最大/最小数据点
•

计算数据点的数量

这些问题的目标是测试直接从图表读取和解释数据的能力，而不需要任何密集的计算。

然后，我们根据两个指标评估两个模型的响应：

•

匹配率%：此指标用于标记图表，其中模型读取图表上打印的数据，但同时报告两者。计算公式为 -

$MatchRate=\left(\frac{\text{Number of perfectly matched answers}}{\text{Total % number of questions}}\right)\times 100$
•
平均绝对误差 (MAE)：此指标针对未标记的图表报告，其中模型“估计”图表中的数字。它测量一组预测中误差的平均大小，而不考虑它们的方向。计算如下：

$MAE=\frac{1}{n}\sum_{i=1}^{n}\left|A_{i}-P_{i}\right|$

在哪里：
- –
  
  $n$ 是数据点的数量，
- –
  
  $A_{i}$ 是第 $i$ 个数据点的实际值，
- –
  
  $P_{i}$ 是第 $i$ 个数据点的预测值。
•
平均绝对百分比误差 (MAPE)：此指标针对未标记的图表报告，其中模型正在“估计”图表中的数字。它可以理解为模型的答案与正确答案的差异程度，标准化为百分比。计算如下：

$MAPE=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{A_{i}-P_{i}}{A_{i}}\right|\times 100$

在哪里：
- –
  
  $n$ 是数据点的数量，
- –
  
  $A_{i}$ 是第 $i$ 个数据点的实际值，
- –
  
  $P_{i}$ 是第 $i$ 个数据点的预测值。

然后，在所有图表的聚合级别上计算这些指标，并分别报告标记和未标记的数据集。

2.1.1 标记图表的方法演练

例如，在图 1（如下）的标记图表中，我们生成了问题并测量了匹配百分比，如表 1 所示

表格1：标签图表的 GPT 和 Gemini 答案比较

Question	Correct Answer	GPT Answer	Gemini Answer
what % of the employee workforce is white in FY23?	55	55	55
what % of the employee workforce is female in FY21?	49	49	51
what % of the employee workforce is ’two or more races’ in FY22?	3	3	8
what % of the employee workforce elected not to provide their race in FY21?	4	4	4
what % of the employee workforce is Asian in FY23?	22	22	7
Match %		100%	60%

2.1.2 未标记图表的方法演练

对于未标记的图表（参见下面的图 2），我们生成了问题并估计了 MAPE，如表 2 所示

表2：无标签图表的 GPT 和 Gemini 答案比较

Question	Correct Answer	GPT Answer	Gemini Answer
How many deals happened in 2018?	30	20	10
How much funding (in millions) was deployed in 2021?	1500	1500	1500
What was the deal count in 2019?	40	20	40
How much funding (in millions) was deployed in 2017?	100	0.5	10
What was the highest deal count between 2017 and 2022?	110	120	110
What was the highest funding (in millions) between 2017 and 2020?	1450	1000	300
MAPE %		37%	39%

2.2调查结果摘要

使用上述方法，下表报告了 31 个图表样本中标记和未标记图表的总体匹配 % 和 MAPE %：

表3：标签图表的完美匹配百分比

Chart Type	GPT Match %	Gemini Match %
Labeled Charts	84%	86%

表 4：未标记图表的 MAPE 百分比

Chart Type	GPT MAPE %	Gemini MAPE %
Unlabeled Charts	55%	53%

2.3 标记图表：图表之间的匹配率是多少？

在 15 个图表和总共 71 个问题中，GPT 4o 完美地端到端读取了 8 个图表，Gemini Flash-1.5 完美地读取了 7 个图表。在出现错误的图表中，模型在所有问题上都不会出错。这两种模型在一些数据点上都会出现较大的错误，这通常是由于误读标签造成的。由于这些不是估计误差，因此误差范围不稳定，如表 5 所示。这两种模型都难以处理存在多个数字的图表。

表 5：标签类别的图表细分：GPT 和 Gemini 匹配百分比和错误范围

Chart Name	Gemini Match %	GPT Match %	Gemini Error Range	GPT Error Range
stacked bar chart	100%	80%	(0, 0)	(0, 5)
bar and table chart	100%	100%	(0, 0)	(0, 0)
piechart	100%	100%	(0, 0)	(0, 0)
simple table	100%	100%	(0, 0)	(0, 0)
icons on chart	100%	100%	(0, 0)	(0, 0)
stacked piechart	100%	100%	(0, 0)	(0, 0)
world map chart	100%	100%	(0, 0)	(0, 0)
bar chart diagonal labels	100%	75%	(0, 0)	(0, 2)
connected bar graphs	86%	100%	(0, 20)	(0, 0)
horizontal stacked bar chart	80%	100%	(0, 15)	(0, 0)
positive-negative bar charts	80%	80%	(0, 878)	(0, 178)
waterfall chart	80%	60%	(0, 194)	(0, 882)
eight multi-bar charts	75%	50%	(0, 1)	(0, 22)
two figures: stacked multi-bar	67%	67%	(0, 4)	(0, 4)
two figures - multi-bars	20%	40%	(0, 52)	(0, 46)
Overall	86%	84%

GPT 4o 故障点

在某些情况下，4o 会误解与堆叠图表中给定标签相关的数据。在“两位数字：堆叠式多栏”中，当询问 2012 年 Athleta 商店数量时，它返回了 Microsoft 商店的数量。同样，在“八个多条形图”中，它错误地返回了日本的糖尿病率而不是肥胖率

在其他情况下，4o 错误地识别了较低分辨率图表上的数字。例如，在“堆积条形图”中，当毕马威要求确定公共部门出版物的百分比时，GPT 4o 将“3”误读为“8”，而 Gemini 没有犯这个错误。

双子座的失败点

与 4o 类似，Gemini 错误是由误解标签引起的，但也包含误解问题的因素，而 GPT 4o 并不经常表现出这一点。例如，在“两位数字 - 多条形”中，当被问及融资来源中有多少%来自私人投资（EPR 除外）时，它会从错误的条形图中回答。另一方面，4o 从正确的表中得到答案，但该表中的标签错误。虽然超出了这些错误率的范围，但我们还注意到，与 Gemini 相比，4o 更擅长回答一些推理/计算问题

2.4 未标记的图表：图表之间的误差大小是多少？

在 15 个未标记图表中的 82 个问题中，GPT 4o 和 Gemini Flash 的平均绝对百分比误差均达到 55%。

表 6：未标记类别的图表细分：GPT 和 Gemini MAPE

Chart Name	Gemini MAPE %	GPT MAPE %
stacked bar low res	226	131
bubble chart	147	177
two figures chart	113	155
stacked bar and pie chart	62	101
single line chart	49	37
bar and line	45	68
line on bar 2	39	37
simple line on bar	35	8
tagged dots chart	33	29
range line chart	32	10
comparison bar charts	29	24
seven line chart	20	18
moon chart	13	13
dots chart	12	37
black line chart	8	7
three line chart	6	24
Overall	52.79	54.64

观察结果

Gemini 和 GPT 4o 都难以处理包含多个未标记数字的图表，错误率均达到 >100%（分别为 113% 和 155%）。这是由该图表的许多/所有问题中的高幅度错误造成的。另一个常见的失败点是堆叠图表，其中中间数字的估计需要估计图表中的所有标签（此处，高分辨率图表的错误率高达 85%，低分辨率图表的错误率高达 230%）
这两种模型也都在处理堆积条形图，其中估计数字可能需要隔离子条形图。
例如，在“堆叠条形图和饼图”中，两个模型都以条形的总值而不是特定子条形标签的值进行响应（50,000 与正确答案 35,000）

两个模型都表现最佳在复杂的三折线图上，折线图的偏差率低至 6%。

附录中给出了图表问题和错误的详细分类。

3结论

虽然 GPT-4o 和 Gemini Flash 1.5 在读取图表方面都表现出了许多先进的功能，但我们的评估揭示了它们当前性能的一些一致的局限性。尽管比其他替代方案更好，但这些模型仍然需要人工监督才能达到可接受的准确性水平。对于标记图表，模型的平均错误率为 15%，这可能不适合高风险的业务应用程序。未标记图表上的性能尤其不一致，对于更复杂的视觉效果，错误率超过 100%。当使用代表典型商业套牌的 30 张综合图表进行评估时，模型只能以 100% 匹配准确度读取 7-8 个图表（所有问题均正确）。因此，对于任何要求高精度的用例，这些模型尚未准备好在没有人工干预的情况下运行。

4附录

4.1 标记图表：问题和模型回答

Chart Type	Question	Correct Answer	Gemini Flash Answer	GPT 4o Answer
two figures: stacked multi-bar
	How many stores did Warby Parker open by 2013?	13	13	13
	In 2012, how many stores did Athleta have?	35	35	31
	In 2013, how many stores did Boston Proper have?	4	8	4
	In 2009, how many stores did Microsoft have?	2	2	0
	What is the % of capex to store related investments in 2012?	71	71	71
	What was the compound annual growth rate of the % of capex to store related investments from 2011 to 2013?	6	4	6
	Perfect Match %		67	67
stacked bar chart
	What % of retail and distribution publications did PwC publish from 2013-15?	6	6	6
	How many industry publications did KPMG have from 2013-15?	626	626	626
	What % of retail and distribution publications did EY publish from 2013-15?	4	4	4
	What % of public sector publications did Deloitte publish from 2013-15?	9	9	9
	What % of public sector publications did KPMG publish from 2013-15?	3	3	8
	Perfect Match %		100	80
connected bar graphs
	What is the projected total global fintech revenue (in $ billion) for 2030?	1500	1500	1500
	How much of the global fintech revenue in 2021 came from the banking sector?	225	245	225
	By how many times is the global fintech revenue expected to grow from 2021 to 2030?	6	6	6
	What is the projected fintech penetration of banking revenues (%) for 2030?	13	13	13
	What was the fintech penetration of banking valuations (%) in 2021?	9	9	9
	What is the projected revenue for the insurance sector in 2030 in billion dollars?	200	200	200
	What was the total global fintech revenue in 2021 in billion dollars?	245	245	245
	Perfect Match %		86	100
eight multi-bar charts
	What is the prevalence of high blood pressure in France?	28.5	28.5	28.5
	How much more prevalent is diabetes in France compared to India?	2.7	2.2	3.9
	What is the prevalence of obesity in Brazil?	22.3	22.3	22.3
	What is the prevalence of obesity in Japan?	4.4	4.4	26.7
	Perfect Match %		75	50
two figures - multi-bars
	What percentage of financing sources that supported the implementation of the solution came from private investment (other than EPR)?	42	0	16
	What is the percentage of governmental funding that supported the implementation of the solution?	16	21	16
	What percentage of financing for scaling the solution came from blended financing?	21	0	21
	How much did philanthropic funding contribute to the scaling of the solution in percentage terms?	21	21	0
	What is the combined percentage of multilateral funding for both the implementation and scaling of the solution?	11	63	57
	Perfect Match %		20	40
bar and table chart
	What is the overall unbanked percentage?	27	27	27
	What is the cash usage in APAC?	59	59	59
	How many underbanked adults in Europe?	357	357	357
	How many more unbanked adults in APAC than in Europe?	766	766	766
	Perfect Match %		100	100
piechart
	What percentage of TMT executives are ready to disclose Scope 1 emissions?	69	69	69
	What percentage of TMT executives are ready to disclose all other indirect emissions that occur in the company’s value chain?	34	34	34
	What percentage of executives overall are ready to disclose Scope 3 emissions?	37	37	37
	What is the difference in the % of TMT executives willing to disclose Scope 2 emissions and % of executives overall willing to disclose Scope 2 emissions?	7	7	7
	Perfect Match %		100	100
horizontal stacked bar chart
	What % of the employee workforce is white in F&23?	55	55	55
	What % of the employee workforce is female in F&21?	49	49	49
	What % of the employee workforce is ’two or more races’ in F&22?	3	3	3
	What % of the employee workforce elected not to provide their race in FY21?	4	4	4
	What % of the employee workforce is Asian in F&23?	22	7	22
	Perfect Match %		80	100
simple table
	What is the trades per month in market C?	10	10	10
	What are the net proceeds in market A?	46	46	46
	What is the price in Market B?	48	48	48
	What is the annual volume in market B?	12000	12000	12000
	What is the net proceed in market B?	43	43	43
	Perfect Match %		100	100
icons on charts
	What was the NPS of Netflix?	68	68	68
	What was the NPS of Amazon?	62	62	62
	What was the NPS of Spotify?	54	54	54
	What was the NPS of Life?	26	26	26
	What was the NPS of Apple?	68	68	68
	Perfect Match %		100	100
positive-negative bar charts
	What is the change in medical claims spent per participant (in $) for ’Hello Heart Users’ in the surgery/inpatient service category?	-482	-482	-482
	What is the change in medical claims spent per participant (in $) for ’Matched Control’ group in the diagnostics service category?	580	-298	580
	What is the change in medical claims spent per participant (in $) for ’Matched Control’ group in the physician visit service category?	40	40	218
	What is the reduction in total medical costs per participant per year (in $)?	1865	1865	1865
	What is the total change in medical claims spent per participant (in $) for ’Hello Heart Users’?	-880	-880	-880
	Perfect Match %		80	80
waterfall chart
	What is the starting IH09? (in $ millions)	1911	1911	1911
	What is the deduction for investing activities (in $ millions)?	97	-97	-97
	What is the addition for cash capex (in $ millions)?	441	441	-441
	What is the pension contribution (in $ millions)?	-187	-187	-187
	What is the cash from operations amount (in $ millions)?	285	285	285
	Perfect Match %		80	60
world map chart
	What is the % of women in the Americas?	49	49	49
	What is the number of people in Asia Pacific?	56386	56386	56386
	What is the number of people in EMA?	142368	142368	142368
	What is the total number of member firms?	143	143	143
	What is the overall % of women globally?	48.5	48.5	48.5
	What is the % of women in Asia-Pacific?	54	54	54
	Perfect Match %		100	100
stacked piechart
	What is the % share of digital engagement in the personal care category?	17	17	17
	What is the % share of digital engagement in the computer software category?	98	98	98
	What is the % share of digital engagement in the food and drink category?	3	3	3
	What % of middle east and africa uses digital banking?	17	17	17
	What is the % share of digital engagement in the consumer banking category?	58	58	58
	Perfect Match %		100	100
bar chart diagonal labels
	What is the accuracy % for resort in Graph Layer = 2	86.7	86.7	88.4
	What is the accuracy % for viaduct in Graph Layer = 1	84	84	84
	What is the accuracy % for forest in Graph Layer = 3	84.9	84.9	84.9
	What is the accuracy % for forest in Graph Layer = 1	83.9	83.9	83.9

	Perfect Match %		100	75

4.2 未标记图表：问题和模型响应

Chart Type	Question	Correct Answer	Gemini Answer	OpenAI Answer
bar and line
	What was the highest revenue growth rate % seen post the financial crisis?	11	10	8
	What was the lowest revenue growth rate % seen before the financial crisis?	5	10	8
	What was the lowest R&D growth rate % seen before the financial crisis?	10	5	5
	What was the revenue growth rate in 2009?	4	4	0
	What was the revenue growth rate in 2019?	11	10	6
	What was the R&D growth rate in 2015?	7.5	0	-2
	Mean Absolute Error (MAE)		3.25	4.92
	Mean Absolute Percentage Error (MAPE) %		44.70	68.23
bubble chart
	What is the change in disease burden between 2020 and 2040 for nutritional deficiencies?	-20	-20	-20
	What is the minimum change in disease burden between 2020 and 2040 for infectious diseases?	-5	-40	-40
	What is the highest increase in disease burden expected from a disease in the ’other diseases’ category?	60	40	50
	What is the change in disease burden between 2020 and 2040 for neglected tropical disease and malaria?	-30	-30	-40
	How many bubbles which represents an ’infectious disease’ have a change in disease burden that ranges from -20 to -40?	3	3	4

	Mean Absolute Error (MAE)		11	15.20
	Mean Absolute Percentage Error (MAPE) %		146.67	176.67
line on bar 2
	How many deals happened in 2018?	30	10	20
	How much funding (in millions) was deployed in 2021?	1500	1500	1500
	What was the deal count in 2019?	40	40	20
	How much funding (in millions) was deployed in 2017?	100	10	0.5
	What was the highest deal count between 2017 and 2022?	110	110	120
	What was the highest funding (in millions) between 2017 and 2020?	1450	300	1000

	Mean Absolute Error (MAE)		210	98.25
	Mean Absolute Percentage Error (MAPE) %		39.33	37.16
tagged dots chart
	What % of survey respondents think Advanced IT will be a widely used skill in the future?	35	26	30
	What % of survey respondents think Basic IT will be a future skill needed?	15	26	30
	What percentage of respondents think ’Entrepreneurship’ will be an expected future skill needed?	27	18	20
	How many categories of skills are included in the limited but growing quadrant?	3	4	3
	How many skills are in the ’limited and stable’ quadrant of this chart?	4	5	5
	What percentage of respondents said that ’complex information processing’ will be an expected future skill needed?	23	25	30
	What percentage of respondents said that ’complex information processing’ is a most widely used skill today?	22	15	20

	Mean Absolute Error (MAE)		5.71	5.29
	Mean Absolute Percentage Error (MAPE) %		33.03	29.25
moon chart
	What is the relative applicability of Machine Learning (AI) (as a %) in the negotiation stage of the procurement process?	100	100	100
	What is the relative applicability of Machine Learning (AI) (as a %) in the invoice checking stage of the procurement process?	25	25	25
	What is the relative applicability of Automation (as a %) in the payment processing stage of the procurement process?	100	100	100
	What is the relative applicability of Automation (as a %) in the RFX process stage of the procurement process?	75	25	25
	What is the relative applicability of cognitive agents (as a %) in the ’Buyer’ stage of the procurement process?	25	25	25

	Mean Absolute Error (MAE)		10	10
	Mean Absolute Percentage Error (MAPE) %		13.33	13.33
black line chart
	What was the traction in Sep ’15 (in $ thousands)?	60	55	60
	What was the traction in Nov ’15 (in $ thousands)?	75	80	80
	What was the traction in Jul ’15 (in $ thousands)?	48	40	60
	What was the highest traction between May ’15 and May ’16 (in $ thousands)?	140	130	137.3
	What was the % increase in traction in the ’Last 30 days’?	22	22	22

	Mean Absolute Error (MAE)		5.60	3.93
	Mean Absolute Percentage Error (MAPE) %		7.76	6.71
two figures chart
	What % of respondents say their frequency of use is 1-3 times a month because of airline miles?	4	18	25
	What % of respondents say their duration of most used credit card is 10 years or more because of low interest rates?	23	10	20
	What % of respondents say their frequency of use is 1-2 times a week because of unlimited cash back?	14	13	10
	What is the highest % of respondents who say their duration of most used credit card is 1 year but less than 2 years?	12	28	35
	What is the lowest % of respondents who say their frequency of use is 7-10 times a week?	6	5	5

	Mean Absolute Error (MAE)		9	10.40
	Mean Absolute Percentage Error (MAPE) %		112.73	154.99
seven line chart
	What were the influenza rates in Feb 2016/17?	28000	25000	30000
	What were the influenza rates in Apr 2014/15?	8000	6000	5000
	What were the influenza rates in Jan 2015/16?	10000	10000	10000
	What was the average influenza rate in Mar between 2014-19?	27000	15000	20000

	Mean Absolute Error (MAE)		4250	3000
	Mean Absolute Percentage Error (MAPE) %		20.04	17.64
single line chart
	What was the growth projection on 12/29/2003?	80000	150000	100000
	What was the growth projection on 5/17/2004?	210000	400000	300000
	What was the actual data on 4/19/2004?	480000	400000	102000
	What was the highest actual growth between 5/03 - Present?	900000	900000	900000

	Mean Absolute Error (MAE)		85000	122000
	Mean Absolute Percentage Error (MAPE) %		48.66	36.65
stacked bar low res
	What were Origin’s ’load and bank guarantee - undrawn’ (in $ million) in 2011?	800	1500	400
	What were Origin’s ’US private placement’ (in $ million) in 2014?	180	400.0	0.0
	What were Origin’s ’load and bank guarantee - undrawn’ (in $ million) in 2010?	90	400	200
	What were Origin’s ’load and bank guarantee - drawn’ (in $ million) in 2012?	1100	1500	200
	What was Origin’s highest US private placement (in $ million) between 2010 and 2020?	250	1600	1000

	Mean Absolute Error (MAE)		596	468
	Mean Absolute Percentage Error (MAPE) %		226.11	130.81
three lines chart
	What was the total return to shareholders by non resilient companies in 2015?	120	120	150
	What was the total return to shareholders by resilient companies in 2013?	230	240	175
	What was the total return to shareholders by resilient companies at the end of recovery phase?	150	150	200
	What was the total return to shareholders by S&P 500 companies at the end of growth phase?	210	250	200
	What was the total return to shareholders by S&P 500 companies at the end of downturn phase?	75	70	100
	What was the total return to shareholders by resilient companies in 2014?	250	250	200
	What was the total return to shareholders by non-resilient companies in 2010?	80	90	100

	Mean Absolute Error (MAE)		9.29	34.29
	Mean Absolute Percentage Error (MAPE) %		6.08	23.62
stacked bar and pie chart
	What is the lead supply ’Today’ in Type 1 channel?	35000	50000	50000
	What is the lead supply in ’12 months’ in Type 1 channel?	90000	100000	200000
	What is the lead supply in ’12 months’ in Type 4 channel?	25000	30	50000
	What is the lead supply in ’Today’ in Type 4 channel?	10000	20	50000
	What % of loan distribution is with Lender 4?	9	18	5
	What % of loan distribution is with Lender 2?	34	34	34
	What % of loan distribution is with ’Other’ Lender?	5	9	5

	Mean Absolute Error (MAE)		8566.14	27143.43
	Mean Absolute Percentage Error (MAPE) %		61.95	101.36
dots chart
	What’s the percentage of the population above empowerment line in Israel in 2022?	73	70	75
	What’s the GDP per capita 2022 in the US? (in $ thousands)	75	65	70
	What’s the GDP per capita 2022 in Ireland (in $ thousands)?	105	90	80
	What’s the percentage of the population above empowerment line in Ireland in 2022?	73	70	75
	What’s the highest percentage of the population above empowerment line for a country with empowerment line set at $12 PPP floor?	40	50	100

	Mean Absolute Error (MAE)		8.20	18.80
	Mean Absolute Percentage Error (MAPE) %		12.17	37.19
comparison bar charts
	What was the instrument handling performance score when using Osso VR Training?	4	3	5
	What was the total performance score when using Standard training?	7.5	8	6
	What was the knowledge of instruments performance score when using Osso VR Training?	3.5	2	5
	What was the time and motion performance score when using standard training?	2.5	1	2
	What was the total performance score when using Osso VR training?	17.5	19	20

	Mean Absolute Error (MAE)		1.20	1.40
	Mean Absolute Percentage Error (MAPE) %		28.62	24.43
simple line on bar
	What was the number of mocap systems (in thousands) in the market in 2024?	75	87.5	80
	What was the number of mocap systems (in thousands) in the market in 2022?	15.0	20	20
	What was the number of assets created by users (in million) per year in 2022?	22.5	0.02	22.5
	What was the number of assets created by users (in million) per year in 2023?	44	40	45
	What was the highest number of assets created by users (in million) between 2021 and 2024?	80	67.5	80

	Mean Absolute Error (MAE)		11.30	2.20
	Mean Absolute Percentage Error (MAPE) %		34.93	8.45
range line chart
	What percentage of respondents prefer in-store experience during the purchase step of shopping journey for the furniture category?	78	39	80
	What percentage of respondents prefer in-store experience during the discovery step of shopping journey for the computer and electronics category?	30	57	30
	What percentage of respondents prefer in-store experience during the pickup step of shopping journey for the furniture category?	35	38	30
	What is the average in-store preference (percentage of respondents) during the trial step of the shopping journey?	80	50	60
	What is the lowest in-store preference % in the pickup step of the shopping journey?	35	36	30
	What is the highest in-store shopping preference % in the trial step of the shopping journey?	85	80	80

	Mean Absolute Error (MAE)		17.50	6.17
	Mean Absolute Percentage Error (MAPE) %		32.47	10.34

参考

[1] B. Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs, arXiv, 2023. https://arxiv.org/pdf/2305.18641.
[2] D. Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, arXiv, 2023. https://arxiv.org/pdf/2308.11224.
[3] E. Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study, arXiv, 2023. https://arxiv.org/pdf/2305.13062.
[4] F. Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey, arXiv, 2024. https://arxiv.org/pdf/2404.01869.
[5] Y. UniChart, "A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning", arXiv, 2023. https://arxiv.org/pdf/2305.14761.
[6] X. Zhang, Y. Xie, H. Zhang, et al. "A Survey on Evaluation of Large Language Models." arXiv preprint arXiv:2307.03109, 2023. Available: https://arxiv.org/pdf/2307.03109