Hi,
It looks like there are some stereo issues that is causing the VO to drift more.
1) The stereo rectification doesn't look perfect, there is a 1 pixel vertical shift between the left and right cameras:

2) Bad time sync between left and right cameras causing very large covariance on some links like this:

Looking at the images, we clearly see that the left image is synced with the wrong right image:

I've shown on the right the resulting point clouds for two consecutive frames. The red one is generated from the top image, where we see that disparity is way larger than the one below for similar point of view. I think this issue appears to be worst when the robot is rotating at the end of each row. I would try to get a good map before trying to navigate in it.
Is there a method to improve localization using only visual information, without adding additional sensors?
Here would be the steps to improve VSLAM:
1) Improve stereo calibration,
2) Fix stereo sync,
3) If VO looks drifting too much even after fixing 1 and 2, you may check to integrate a VIO approach instead,
4) For visual localization, using SIFT/SURF/SuperPoint could help to localize over time. I see you are outdoor, classic features like ORB/BRIEF/SIFT/SURF are quite sensible to illumination changes / shadows, features like SuperPoint may be more robust in those cases.
cheers,
Mathieu