Skip to content

Conversation

tis-abe-akira
Copy link

Description
This PR introduces two new parameters to control table detection during Markdown output:

  • min_table_rows: Minimum number of rows required for a table to be included
  • min_table_cols: Minimum number of columns required for a table to be included

Both parameters include basic validation to prevent invalid values.


Motivation / Background
In the current implementation, tables with fewer than 2 rows are automatically excluded. While this works for most cases, real-world PDFs sometimes contain edge cases where a single-row table should still be recognized:

  • When the last row of a table is split onto the next page
  • When a separator line appears immediately after the header, resulting in only one detected row

In these situations, treating all single-row tables as invalid leads to loss of useful information. By making the minimum rows and columns configurable, users gain flexibility to adapt table detection to their specific document structures.


Changes

  • Added parameters min_table_rows and min_table_cols
  • Added validation for parameter values
  • Updated table filtering logic to use configurable thresholds

Notes
This change preserves the current default behavior (min_table_rows = 2, min_table_cols = 2) to avoid breaking existing usage, while allowing users to override when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant