On Call Guide
On-call responsibilities and process for CiviForm.
Last updated
Was this helpful?
On-call responsibilities and process for CiviForm.
Last updated
Was this helpful?
Exygy engineers are responsible for on-call shifts, and Google.org and Civic Entity engineers can opt-in to the rotation. To get added to the rotation, contact Nick Burgan on Slack.
Do these things when you initially onboard to the CiviForm on-call rotation.
Join the CiviForm GitHub repo (). Ask someone who already has access to add you as an admin.
Also join the CiviForm GitHub org ().
Add the CiviForm Shared Google Calendar. Contact Nick Burgan or Rocky Fine on Slack to get added to the permissions list.
Join the CiviForm public Slack org (https://civiform.slack.com),
Subscribe to email alerts for new bugs filed in the . Click on "Watch/Unwatch" button at the top of the , select "Custom" and select "Issues".
Ensure that you are receiving emails to civiform-technical@googlegroups.com and are not catching them in email filters.
Subscribe to the
Subscribe to the
Subscribe to the
Things to do before each on-call shift starts.
Check if there are any current urgent bugs. If there are, make sure you know what the state of response is (check in with the previous on-caller).
Respond to downstream production incidents (daily)
Check security mailing lists for new vulnerability reports (daily)
If there is something that looks critical, post in #eng-prod-incidents or post in #eng-general if you aren't sure.
This issue is one that Renovate creates and updates with what it is currently tracking. Check this for any rate-limited dependencies and check the box to create them.
For any problematic dependency updates that break tests, add the "needs-triage" label so Exygy can prioritize fixing these.
If you come accross an issue that could use a playbook or further documentation, create a github issue to track that additional documentation is needed. Assign it to yourself or the next oncaller if you don't have capacity.
The top priority for the on-caller is addressing urgent needs from downstream deployments of CiviForm. An urgent need is an outage, privacy, or security incident caused by bugs in the CiviForm application or deployment code. When an incident occurs it may not be clear what the root cause is and whether or not it is ultimately the responsibility of the upstream project to resolve it. Assume it is though until proven otherwise.
Incidents may be reported in a variety of ways. Since they're coming from civic entities and not Google or Exygy internal staff or tooling we have limited control over this. At a minimum you should monitor:
Bugs filed in the GitHub issues tracker
Emails to civiform-technical@googlegroups.com
The CiviForm Slack channel, particularly #engineering and #general
Whatever the mechanism of reporting the incident, ensure there is an issue tagged bug
for it in GitHub issues. Throughout your investigation into the issue, ensure public visibility in the resolution by updating the issue with your progress.
Tip: Your primary responsibility with respect to incident response is to triage and ensure resolution as is appropriate. That does NOT mean you are solely responsible for implementing fixes. Delegate fixes to whoever is most able to help as necessary.
Do not merge terraform-related dependency updates without first manually exercising the code, we do not have automated tests for terraform/deployment configuration. Feel free to close related PRs and file issues for performing the upgrade.
Ensure your CiviForm development environment is set up and working. Pull in the latest changes to the main
branch if yours is out of date. Follow instructions for with a dev environment.
We use GitHub issues for tracking work on CiviForm. track known bugs.
on Tuesday by 12pm Pacific Time
Upgrade the version in the demo sites config files by running then merging the generated PR
Check GitHub
Check
Check the that are created to make sure there aren't any P0s (daily)
Monitor staging deployments in the Slack channel. Investigate failed deployments and re-run if appropriate. (Note: our browser tests can be flakey and case deployments to fail. If this is the case, re-running the deployment will often fix the issue.)
Check the (once per shift)
Check security updates at
Create an oncall issue for the next rotation using the and close the oncall issue assigned to you.
In open source software development, it's common for library maintainers to release updates when a new security vulnerability is discovered. Subscribe to the security mailing lists mentioned in the . If you receive an advisory during your on-call shift, including OpenJDK advisories, the most common response will be to update the appropriate dependency to the latest patch version that includes a fix. Once the update is available, Renovate will automatically create a PR with the update. If the vulnerability is being actively exploited and an update with a fix isn't available yet, create an issue in GitHub and triage it appropriately so that it receives immediate attention.
CiviForm relies on versioned dependencies managed by an open source dependency management system. These dependencies include along with a variety of other libraries that provide functionality such as view rendering, database interaction, cryptographic tools, data serialization, and more.
CiviForm's dependencies are mostly listed in the . Dependencies in here are retrieved by (CiviForm's build tool) from the , which is where you can check to see if new versions are availabe. Additionally, there are some dependencies managed as . These dependencies must be checked at their individual project pages for updates.
CiviForm relies on to automatically detect new versions of dependencies and create pull requests to update them. It is the on-call engineer's responsibility to review and merge these pull requests as they come in. Do not simply approve and merge every pull request renovate bot creates. While in most cases passing CI checks indicates the change is acceptable, that not always the case and more diligence is required (). Be sure you understand what is being updated before approving. If need be, get in touch with the broader engineering team to help evaluate a given PR. For PRs that break tests, add the "needs-triage" label to the PR so Exygy can prioritize fixing these issues.
We have all of our demo sites managed in . In order to upgrade the demo sites' version number in the config files, we can run the . This will create a pull request that you can approve and merge. By default, this new version will get picked up the next time the site is automatically deployed. In order to do a one-off deployment of the demo sites, the can be used. Please make sure nobody is actively using the demo site if you are planning to run the deploy action to ensure no issues.