Checklist: Node.JS production best practices

Last updated: 09/09/2017 

Welcome to my comprehensive collection of tips on running Node.JS in production. It aims to summarize most of the knowledge gathered to date from the highest ranked blog posts.

Don’t miss out: nearby each best practice a “GIST Popup” icon appears icon-html-inc-snippet , clicking on it will show further explanation, quotes and code examples

Written by Yoni Goldberg – An independent Node.JS developer and consultant


1. Monitoring!

TL;DR: Monitoring is a game of finding out issues before our customers do – obviously this should be assigned unprecedented importance. The market is overwhelmed with offers thus consider starting with defining the basic metrics you must follow (my suggestions inside), then go over additional fancy features and choose the solution that tick all boxes. Click ‘The Gist’ below for overview of solutions

Otherwise: Failure === disappointed customers. Simple.

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


2. Increase transparency using smart logging

TL;DR: Logs can be a dumb warehouse of debug statements or the enabler of a beautiful dashboard that tells the story of your app. Plan your logging platform from day  1: how logs are collected, stored and analyzed to ensure that the desired information (e.g. error rate, following an entire transaction through services and servers, etc) can really be extracted

Otherwise: You end-up with a blackbox that is hard to reason about, then you start re-writing all logging statements to add additional information

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


3. Delegate anything possible (e.g. gzip, SSL) to a reverse proxy

TL;DR: Node is awfully bad at doing CPU intensive tasks like gzipping, SSL termination, etc. Instead, use a ‘real’ middleware services like nginx, HAproxy or cloud vendor services

Otherwise: Your poor single thread will keep busy doing networking tasks instead of dealing with your application core and performance will degrade accordingly

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


4. Lock dependencies

TL;DR: Your code must be identical across all environments but amazingly NPM lets dependencies drift across environments be default – when you install packages at various environments it tries to fetch packages’ latest patch version. Overcome this by using NPM config files , .npmrc, that tell each environment to save the exact (not the latest) version of each package. Alternatively, for finer grain control use NPM” shrinkwrap”. *Update: as of NPM5 , dependencies are locked by default. The new package manager in town, Yarn, also got us covered by default

Otherwise: QA will thoroughly test the code and approve a version that will behave differently at production. Even worse, different servers at the same production cluster might run different code

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


5. Guard process uptime using the right tool

TL;DR: The process must go on and get restarted upon failures. For simple scenario, ‘restarter’ tools like PM2 might be enough but in today ‘dockerized’ world – a cluster management tools should be considered as well

Otherwise: Running dozens of instances without clear strategy and too many tools together (cluster management, docker, PM2) might lead to a devops chaos

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


6. Ensure error management best practices are met

TL;DR: Error management must be the most time-consuming and painful task in keeping Node.JS environments stable. This is happening mostly due to the ‘one thread’ model and the lack of proper strategy for error paths in asynchronous flows. No shortcuts here, you must fully understand and tame the error management beast. My list of error handling best practices might get you there quicker

Otherwise: Crazy stuff will go on such as process crashing only because a user passed-in an invalid JSON, errors disappear without a trace and stack-trace information revealed to the end-user

icon-html-inc-snippet Click here for my list of Node.JS error handling best practices


7. Utilize all CPU cores

TL;DR: At its basic form, a Node app runs over a single CPU core while as all other are left idle. It’s your duty to replicate the Node process and utilize all CPUs – For small-medium apps you may use Node Cluster or PM2. For a larger app consider replicating the process using some Docker cluster (e.g. K8S, ECS) or deployment scripts that are based on Linux init system (e.g. systemd)

Otherwise: Your app will likely utilize only 25% of its available resources(!) or even less. Note that a typical server has 4 CPU cores or more, naive deployment of Node.JS utilizes only 1 (even using PaaS services like AWS beanstalk!)

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


8. Create a ‘maintenance endpoint’

TL;DR: Expose a set of system-related information, like memory usage and REPL, etc in a secured API. Although it’s highly recommended to rely on standard and battle-tests tools, some valuable information and operations are easier done using code

Otherwise: You’ll find that you’re performing many “diagnostic deploys” – shipping code to production only to extract some information for diagnostic purposes

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


9. Discover errors and downtime using APM products

TL;DR: Monitoring and performance products (a.k.a APM) proactively gauge codebase and API so they can auto-magically go beyond traditional monitoring and measure the overall user-experience across services and tiers. For example, some APM products can highlight a transaction that loads too slow on the end-users side while suggesting the root cause

Otherwise: You might spend great effort on measuring API performance and downtimes, probably you’ll never be aware which is your slowest code parts under real world scenario and how these affects the UX

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


10. Make your code production-ready

TL;DR: Code with the end in mind, plan for production from day 1. This sounds a bit vague so I’ve compiled inside (click Gist below) few development tips that are closely related to production maintenance

Otherwise: A world champion IT/devops guy won’t save a system that is badly written

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


11. Tick the obvious security boxes

TL;DR: Node embodies some unique security challenges, in this bullet I’ve grouped the straightforward security measures. Goes without saying that a “Secured” system requires a much more extensive security analysis

Otherwise: What is worth than a security leak that is covered in press? a no-brainer security issue that you just forgot to address

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples

Love this? read my ‘Error Handling Best Practices’ guide


12. Measure and guard the memory usage

TL;DR: Node.js has controversial relationships with memory: the v8 engine has soft limits on memory usage (1.4GB) and there are known paths to leaks memory in Node’s code – thus watching Node’s process memory is a must. In small apps you may gauge memory  periodically using shell commands but in medium-large app consider baking your memory watch into a robust monitoring system

Otherwise: Your process memory might leak a hundred megabytes a day like happened in Wallmart

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


13. Get your frontend assets out of Node

TL;DR: Serve frontend content using dedicated middleware (nginx, S3, CDN) because Node performance really get hurts when dealing with many static files due to its single threaded model

Otherwise: Your single Node thread will keep busy streaming hundreds of html/images/angular/react files instead of  allocating all its resources for the task it was born for – serving dynamic content

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


14. Be stateless, kill your Servers almost every day

TL;DR: Store any type of data (e.g. users session, cache, uploaded files) within external data stores. Consider ‘killing’ your servers periodically or use ‘serverless’ platform (e.g. AWS Lambda) that explicitly enforces a stateless behavior

Otherwise: Failure at a given server will result in application downtime instead of a just killing a faulty machine. Moreover, scaling-out elasticity will get more challenging due to the reliance on a specific server

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


15. Use tools that automatically detect vulnerabilities

TL;DR: Even the most reputable dependencies such as Express have known vulnerabilities from time to time that put a system at risk. This can get easily tamed using community and commercial tools that constantly check for vulnerabilities and warn (locally or at GitHub), some can even patch them immediately

Otherwise: Keeping your code clean from vulnerabilities without dedicated tools will require to constantly follow online publications about new threats. Quite tedious

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


16. Assign ‘TransactionId’ to each log statement

TL;DR: Assign the same identifier, transaction-id: {some value}, to each log entry within a single request. Then when inspecting errors in logs, easily conclude what happened before and after. Unfortunately, this is not easy to achieve in Node due its async nature, see code examples inside

Otherwise: Looking at a production error log without the context – what happened before – makes it much harder and slower to reason about the issue

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


17. Set NODE_ENV=production

TL;DR: Set the environment variable NODE_ENV to ‘production’ or ‘development’ to flag whether production optimizations should get activated – many NPM packages determining the current environment and optimize their code for production

Otherwise: Omitting this simple property might greatly degrade performance. For example, when using Express for server side rendering omitting NODE_ENV makes the slower by a factor of three!

icon-html-inc-snippet THE GIST popup: click here for quick examples, quotes and code examples


18. Design automated, atomic and zero-downtime deployments

TL;DR: Researches show that teams who perform many deployments – lowers the probability of severe production issues. Fast and automated deployments that don’t require risky manual steps and service downtime significantly improves the deployment process. You should probably achieve that using Docker combined with CI tools as they became the industry standard for streamlined deployment

Otherwise: Long deployments -> production down time & human-related error -> team unconfident and in making deployment -> less deployments and features

Generic topic, read further information on the web. This topic is not related directly to Node.JS.


19. Bump your NPM version in each deployment

TL;DR: Anytime a new version is released, increase the package.json version attribute so that it will become clear in production which version is deployed. All the more so in MicroService environment where different servers might hold different versions. The command “npm version” can achieve that for you automatically

Otherwise: Frequently developers try to hunt a production bug within a distributed system (i.e.multiple versions of multiple services) only to realize that the presumed version is not deployed where they look at


20. Stay tuned, more are coming soon

TL;DR: I’m about to write here soon about other production best practices like post-mortem debugging, tuning the libuv thread pool, creating production smoke tests and more. Want to stay updated? Follow my Twitter or Facebook pages

  • Icaro Tavares

    In chapter 10 GIST have some writing issues like “memory” and “callbabk”.
    Reading this i can say: NICE article, a lot of knowledge for all nodejs developers in one website.
    Thanks for sharing this to nodejs devs.

    • Yoni Goldberg

      @icaro_tavares:disqus really glad to hear those nice words. Thanks for pointing out about the TYPOs (I ran the entire article through spell-chekcer, might have missed chapter 10) – will fix very soon

  • German Torvert

    Very cool article! It touches many aspects that you bypass ,forget or didn’t know of, while stuck in a daily routine.
    Refreshing.Nicely written.Thank you!

    • Yoni Goldberg

      Thank you German! These kind words motivate me to start writing the next part

  • Great article! However points 17, 18, 19, and 20 all link to the same pop-up about setting NODE_ENV to production.

    Looking forward to putting these tips into action. Thanks!

    • Yoni Goldberg

      @disqus_PVzVkew1KD:disqus good catch! fixed that issue. Glad you liked it!

  • Andrzej

    Great article and very useful. Thanks so much. Keep going.

    • Yoni Goldberg

      @disqus_BoXLhymvsS:disqus Thanks a lot. Which type of Best Practices would you like to see next:

      Project Setup or Testing or Deployment or Data Access or API?

      • Great article! I can now remove several bookmarks about NodeJS))
        It possible to look a bit about Project Setup?

        P.S.: cool UX on you website))

        • Yoni Goldberg

          @mishadatsko:disqus thanks man, happy to hear those words. I also tend to opt for the ‘project setup’ option. What is your main concern: structure? dev tools? CI? packages?

          • Cool) I think the main concern will be structure and packages. Also some info about packages security and possible structure changes on scaling at maintenance period.

          • Yoni Goldberg

            @mishadatsko:disqus do you mean which specific packages to use for various tasks like web framework, logging, validation, authentication, data access, linting, etc?

          • Yes, like: “this package is very useful because of…”. I think this will be very helpful)

          • Yoni Goldberg

            sounds like a challenge I’d love to pick, thanks for the tips 🙂

          • glad you like it))

  • Leonardo Rodriguez

    Excellent article friend. For the #18 I suggest and I use Bitbucket Pipelines (with docker below) for 10 monthly dollars they give you 500min build time, setting up the environment is very easy (DockerFile) here is documentation.

    • Yoni Goldberg

      @disqus_6UNlE7UTTd:disqus thank you. Any specific reasons to prefer BitBucket CI features over Jenkins/Circle/Travis/etc?

  • dman777

    I don’t recommend docker on production systems. All containers share the same kernel on the host, so any kernel panic in one container can effect the others.

    • Yoni Goldberg

      @dman777:disqus intresting argument, any chance you can share a link that distills why kernel panic would affect few containers more than ‘casual’ Node processes?

  • John Best

    This was an excellent article. Really appreciate your insights.

    • Yoni Goldberg

      @disqus_7BCYqSXKtQ:disqus really happy to hear this, motivates me toward my next post 🙂

  • cztomsik

    Node will use up to 100% of your CPU even without docker/pm2 – it depends on the amount of I/O which is done by libuv, which is multi-threaded library written in C++. Node has intentionally chose javascript because of event-loop which allows such thing (your code always runs single-threaded but I/O runs in parallel)

    • Yoni Goldberg

      @cztomsik:disqus you’ve a strong point here. While writing this, I considered a case where the ‘user land’, v8 single thread code, is the bottleneck – in that case using more processes and cores would unleash the performance. However, when I think about it now, I hardly imagine a situation where the load won’t be on the IO side (unless a developer performs heavy CPU tasks, probably rare situation) so it seems like I have to fix and point that the process replication should be done mostly for redunancy purposes.

  • Nice article – for those interested more in monitoring, see “top nodejs metrics to watch”:
    I would add alerting on logs and metrics as mandatory for production. We push alerts in a slack channel, so we get notified on mobile when our process throws some “uncaught errors” or when node event loop or http responses show high delays.
    We developed a log shipper including a smart log parser in nodejs – of course it got a plugin to collect nodejs metrics to monitor the log shipper 🙂 – The logging framework (winston, bunyan, pino, …) might have an impact on performance as well. Logging consumes CPU cycles and reduces therefore e.g. web server throughput. Pino is a logging framework tuned to be fast (and bunyan compatible) This article shows performance impact in a few production scenarios:
    On Docker we use Sematext Docker Agent, which gets us metrics, logs and docker events. For nodejs specific metrics we run spm-agent-nodejs embedded in our node based web services. We don’t deploy any solution without having setup done for logs & metrics collection and we eat our own dog food 🙂

    • Yoni Goldberg

      Stefan, great tips here. Thank you

      May I ask:
      – “Slack notification when our process throws some “uncaught errors” or when node event loop or http responses show high delays”
      What is technical flow here: how do you determine thos events (uncaught, event loop delay) and stream to Slack? there are multiple options here, wonder what do you use

      I’ll take a look at Pino

      • We use Sematext Cloud, an all in one logging & monitoring solution (full disclosure: I work for Sematext). Notfications of anonmaly detection alerts from any metric or log query can be pushed to BigPanda, VictorOps, PagerDuty, Slack or any other Webhook. or just forwarded via e-mail.
        I think the methodology to watch key metrics and push alerts to incident management platforms or real-time collaboration tools is a general one and not bound to a specific product. There are open source alternatives like Prometheus exporters for Node.js and creating Graphana Dashboards but then you just need to invest more time in the setup and own monitoring & logging infrastructure (such as Prometheus server and setting up Elasticsearch, Filebeat, Kibana, Elastalert and maintaining growing storage for logs and time series databases for metrics). You could also choose other commercial monitoring and logging products like Datadog/NewRelic or Loggly/Papertrail/Splunk, but then you don’t get metrics and logs in one platform, which costs a bit more time trouble shooting as you need to switch between monitoring and logging user interfaces to get all information to your screen.

© 2017 Yoni Goldberg. All rights reserved.