We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the self-created data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs.
We evaluate the code solution generation ability of TableLLM on three benchmarks: WikiSQL, Spider and Self-created table operation benchmark. The text answer generation ability is tested on four benchmarks: WikiTableQuestion (WikiTQ), TAT-QA, FeTaQA and OTTQA. The evaluation result is shown below:
Model | WikiTQ | TAT-QA | FeTaQA | OTTQA | WikiSQL | Spider | Self-created | Average |
---|---|---|---|---|---|---|---|---|
TaPEX | 38.5 | – | – | – | 83.9 | 15.0 | / | 45.8 |
TaPas | 31.5 | – | – | – | 74.2 | 23.1 | / | 42.9 |
TableLlama | 24.0 | 22.2 | 20.5 | 6.4 | 43.7 | 9.0 | / | 20.7 |
GPT3.5 | 58.5 | 72.1 | 71.2 | 60.8 | 81.7 | 67.4 | 77.1 | 69.8 |
GPT4 | 74.1 | 77.1 | 78.4 | 69.5 | 84.0 | 69.5 | 77.8 | 75.8 |
Llama2-Chat (13B) | 48.8 | 49.6 | 67.7 | 61.5 | – | – | – | 56.9 |
CodeLlama (13B) | 43.4 | 47.2 | 57.2 | 49.7 | 38.3 | 21.9 | 47.6 | 43.6 |
Deepseek-Coder (33B) | 6.5 | 11.0 | 7.1 | 7.4 | 72.5 | 58.4 | 73.9 | 33.8 |
StructGPT (GPT3.5) | 52.5 | 27.5 | 11.8 | 14.0 | 67.8 | 84.8 | / | 48.9 |
Binder (GPT3.5) | 61.6 | 12.8 | 6.8 | 5.1 | 78.6 | 52.6 | / | 42.5 |
DATER (GPT3.5) | 53.4 | 28.4 | 18.3 | 13.0 | 58.2 | 26.5 | / | 37.0 |
TableLLM-7B (Ours) | 58.8 | 66.9 | 72.6 | 63.1 | 86.6 | 82.6 | 78.8 | 72.8 |
TableLLM-13B (Ours) | 62.4 | 68.2 | 74.5 | 62.5 | 90.7 | 83.4 | 80.8 | 74.7 |
If you have any questions, we encourage you to either create Github issues or get in touch with us at zhang2718@ruc.edu.cn, zeyaoma@ruc.edu.cn or zhang-jing@ruc.edu.cn.