Skip to content

Conversation

leonardehrenfried
Copy link
Member

@leonardehrenfried leonardehrenfried commented Jul 14, 2025

Summary

Lowers the graph build memory by packing GTFS shape points more tightly and getting rid of a lot of duplicated copies of all shape points which would take up large amounts of RAM.

The core of the memory savings are in packing coordinates tightly into an array rather than lists of shape points.

Results

One of my feeds contained 60 million (!) shape points and there I could lower the required memory from 21 to 9GB!

In Helsinki I can see a more modest decrease from 9 to 6GB of memory. However, this is still 30%.

Unit tests

Added and updated.

Documentation

Javadoc.

Bumping the serialization version id

The actual graph entities are not changed and therefore it's not necessary.

@leonardehrenfried leonardehrenfried requested a review from a team as a code owner July 14, 2025 09:29
@leonardehrenfried leonardehrenfried added GTFS Related to import of GTFS data Technical Debt labels Jul 14, 2025
Copy link

codecov bot commented Jul 14, 2025

Codecov Report

❌ Patch coverage is 86.15385% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.87%. Comparing base (a8cb876) to head (fd05541).
⚠️ Report is 184 commits behind head on dev-2.x.

Files with missing lines Patch % Lines
...ain/java/org/opentripplanner/model/ShapePoint.java 72.22% 1 Missing and 4 partials ⚠️
...aph_builder/module/geometry/GeometryProcessor.java 73.33% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             dev-2.x    #6752      +/-   ##
=============================================
+ Coverage      71.83%   71.87%   +0.03%     
  Complexity     19130    19130              
=============================================
  Files           2074     2075       +1     
  Lines          77784    77754      -30     
  Branches        7940     7939       -1     
=============================================
+ Hits           55879    55888       +9     
+ Misses         19090    19054      -36     
+ Partials        2815     2812       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@leonardehrenfried leonardehrenfried changed the title Lower graph build memory by processing shapes more efficiently Lower graph build memory by processing GTFS shapes more efficiently Jul 14, 2025
tkalvas
tkalvas previously approved these changes Jul 17, 2025
tkalvas
tkalvas previously approved these changes Jul 22, 2025
@leonardehrenfried
Copy link
Member Author

  • Sort the incoming Collection<org.onebusaway.gtfs.model.ShapePoint> on (shapeId, shapeSequence) before mapping it

On further inspection I see you're setting packShapePoints, so the OBA Collection you're reading from is a custom implementation that probably can't be sorted with Collections.sort. But then it's noteworthy that its primitive arrays are being mapped into objects by OBA and then mapped back into primitive arrays by OTP. It would be really nice if OBA just had a method to sort its ShapePointArray in place... but I guess abstractions always have a cost.

I'm trying to skip OBA altogether in a follow up.

@leonardehrenfried
Copy link
Member Author

I've now reworked the CompactShape to also work with very large gaps in the sequence numbers. It does require more memory: in my case with 60 million shape points memory consumption goes from 9GB to 11.5GB.

@abyrd
Copy link
Member

abyrd commented Jul 30, 2025

This will create lots of these intermediate collections that were the source of the high memory consumption.

Got it. So for CompactShape to use dense arrays, there would need to be some way to sort or reorder on the fly for out-of-order elements.

I considered that but Trove has been deprecated since 2017 so I'm reluctant to use it more.

That readme also says "Trove is quite stable and continues to be useable," which is my experience. It's a simple enough concept that as a library it's basically "done" and doesn't need much maintenance. If I was starting from scratch I'd probably use Fastutil but Trove still works flawlessly everywhere I've used it. I think TIntHashSet and TIntHashMap are still used in various places in OTP so it shouldn't be objectionable.

@abyrd
Copy link
Member

abyrd commented Jul 30, 2025

What do you think about always using dense arrays by storing shape point data in the arrays in the order they come in and then additionally have a (Trove) map that maps the sequence number back to the index in the array?

Yes, that seems good. I was thinking more of using a sort function that sorts one array based on another, and I thought there was one in Apache Commons but I must have been mistaken. What you did with the map looks like it would work fine, though as you said it takes more memory.

Since the ShapePoints are used once and thrown away, couldn't CompactShape just have a method that builds a temporary list of all ShapePoints for that CompactShape, sorts that list, then returns that throwaway list of throwaway objects (or an iterator for that list)? This would increase the memory consumption from one single ShapePoint at a time to all ShapePoints for one shape at a time, but that's still very small and bounded.

@abyrd
Copy link
Member

abyrd commented Jul 31, 2025

Since the ShapePoints are used once and thrown away, couldn't CompactShape just have a method that builds a temporary list of all ShapePoints for that CompactShape, sorts that list, then returns that throwaway list of throwaway objects (or an iterator for that list)?

Here's an example of what I mean:
https://github.com/opentripplanner/OpenTripPlanner/blob/abyrd/efficient-shapes/application/src/main/java/org/opentripplanner/gtfs/mapping/CompactShape.java

This is untested but I think it would have memory consumption closer to your original version. Seems quite readable/maintainable too.

@optionsome optionsome requested review from t2gran and vpaturet and removed request for tkalvas and t2gran August 5, 2025 08:21
Copy link
Contributor

@vpaturet vpaturet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only some minor comments.

@leonardehrenfried
Copy link
Member Author

@abyrd I've taken your version and modified it a little bit: I removed the intermediate list generation and went from stream directly to iterator. It probably still materializes a collection somewhere but maybe the standard lib uses some optimization tricks.

@t2gran t2gran added this to the 2.8 (next release) milestone Aug 11, 2025
@leonardehrenfried leonardehrenfried merged commit 800aeab into opentripplanner:dev-2.x Aug 13, 2025
7 checks passed
t2gran pushed a commit that referenced this pull request Aug 13, 2025
@leonardehrenfried
Copy link
Member Author

I've merge this, @abyrd, since I believe that I satisfied your requests. If not, add your comments and I will send a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GTFS Related to import of GTFS data Technical Debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants