Monitoring and managing the impact of query performance on Osquery

Performance testing and bench-marking Osquery queries

The Osquery watchdog

The Osquery watchdog is only used in the daemon osqueryd not osqueryi.

Scheduled query may have failed: <query_name>

Osquery watchdog command line options

The Osquery watchdog can be configured via the command line with the following flags:

  • --watchdog_level=0
    The watchdog supervisor can be run in one of three modes. These modes are used to configure the performance limits:
  • Normal --watchdog_level=0 (default)
    The default performance limits are 200MB memory cap and 25% CPU usage for 9 seconds. The default mode allows for 10 restarts of the worker if the limits are violated.
  • Restrictive --watchdog_level=1
    The restrictive profile allows a 100MB memory cap and 18% CPU usage for 9 seconds. The restrictive mode allows for only 4 restarts before the service is disabled.
  • Disabled --watchdog_level=-1.
    It is better to set the watchdog to disabled. Rather than disabling the watchdog outright. As the worker/watcher concept is used for extensions too.
  • Watchdog CPU limits --watchdog_utilization_limit=0
    The utilization limit value is the maximum number of CPU cycles counted as the processes. The default is 90. meaning less 90 seconds of cpu time per 3 seconds of wall time is allowed.
  • Watchdog delay --watchdog_delay=60
    This is the delay in seconds before the watchdog process starts enforcing memory and CPU utilization limits. The default is 60 seconds. This allows the daemon to perform resource intense actions, such as forwarding logs, at startup.

Osquery profiler: profile.py

Osquery provides a really cool tool to help profile queries and test our different variants based on your Osquery config. This tool can be downloaded via the Osquery GitHub repository (python3 is required).

pip3 install psutil
git
clone git@github.com:osquery/osquery.git
{
"schedule": {
"proc1": {
"query": "SELECT * FROM processes;",
"interval": 60
},
"proc2": {
"query": "SELECT * FROM processes WHERE pid > 1000;",
"interval": 60
}
}
}
./tools/analysis/profile.py --shell `which osqueryi` --config test.confProfiling query: SELECT * FROM processes;
U:2 C:1 M:3 F:0 D:2 proc1 (1/1): utilization: 43.95 cpu_time: 0.44 memory: 26013696 fds: 4 duration: 1.0
Profiling query: SELECT * FROM processes WHERE pid > 1000;
U:2 C:1 M:3 F:0 D:2 proc2 (1/1): utilization: 43.95 cpu_time: 0.44 memory: 26013696 fds: 4 duration: 1.0
./tools/analysis/profile.py --shell `which osqueryi` --query "SELECT * FROM processes;" --rounds 3 --count 10Profiling query: SELECT * FROM processes; U:3  C:2  M:3  F:0  D:3  manual (1/3): utilization: 78.66 cpu_time: 2.37 memory: 26435584 fds: 4 duration: 3.0 
U:3 C:2 M:3 F:0 D:3 manual (2/3): utilization: 84.87 cpu_time: 3.41 memory: 26210304 fds: 4 duration: 4.0
U:3 C:2 M:3 F:0 D:2 manual (3/3): utilization: 75.25 cpu_time: 2.27 memory: 25980928 fds: 4 duration: 2.5
U:3 C:2 M:3 F:0 D:3 manual avg: utilization: 79.59 cpu_time: 2.6 memory: 26208938 fds: 4.0 duration: 3.2
Osquery profiler results.
  • CPU time C:2
    Shows the total CPU time. Containing user, system, children_user, system_user.
  • Memory M:26208938
    Shows the total memory used in bytes. The above example would be 26MB used.
  • File descriptors (FDS) F:4
    Shows the number of file descriptors used by the osqueryi process during query execution.
  • Duration D:3.2
    The number of seconds elapsed whilst running the query.

Osquery osquery_schedule

Once you’ve got all your queries written a tuned locally. You’ll obviously want to find out how they perform in the real world across your fleet. Fortunately, when running Osquery in a remote distributed fashion you can use the osquery_schedule table to ascertain performance metrics on scheduled queries.

SELECT * FROM osquery_schedule WHERE last_executed > 0;
Querying Osquery’s osquery_schedule table to fetch performance information related to our scheduled queries.
  • query
    The exact query which was run e.g. SELECT * FROM users;
  • interval
    The number of seconds this query is set to run. Please note this is not guaranteed as per the Osquery scheduler.
  • executions
    This is the number of times the query was executed
  • last_executed
    UNIX time stamp in seconds of the last completed execution
  • denylisted
    If the query keeps hitting the limits imposed by the Osquery watchdog. The query will be prevented from running in the future. So the users machine isn’t impacted. The denylisted result will either by a 1 if the query has been denylisted. Or 0 if the query is still able to run.
  • output_size
    Total number of bytes generated by the query.
  • wall_time
    Total wall time spent executing. This is the elapsed time, including time spent waiting for its turn on the CPU. Note: that this is that total amount of time and not the last result.
  • user_time
    Total user time spent executing in user land. Note: that this is that total amount of time and not the last result.
  • system_time
    Total system time spent executing by the kernel. Note: that this is that total amount of time and not the last result.
  • average_memory
    This is the average private memory (resident_size) left after executing. Divided by the number of executions.
SELECT name, query, interval, executions, last_executed, denylisted, output_size,
IFNULL(system_time / executions, 0) AS average_system_time,
IFNULL(user_time / executions, 0) AS average_user_time,
IFNULL(wall_time / executions, 0) AS average_wall_time,
ROUND((average_memory * '10e-7'), 2) AS average_memory_mb
FROM osquery_schedule;

Optimizing queries

Phew, right. That was the last thing on our list as far as what we can do inside Osquery to optimize and monitor our queries.

Table delay

When Osquery performs joins between tables. You can actually add a small wait before those tables are loaded. A 200 microsecond delay will trade about 20% additional time for a reduced 5% CPU utilization.

File Hashing

If you’re hashing files within Osquery and doing it constantly. Maybe for hashing the binaries of running processes. Osquery will cache already processed files for you. The --hash_cache_max=500 flag will increase the resident in memory cache. Old items will be evicted.However, if you’re running the same query over and over this can greatly improve the query’s performance.

Its all over!

Hopefully that’s given you a quick dive into performance monitoring with Osquery. However, that’s all for now. Please feel free to get in touch if you have any questions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Zercurity

Zercurity

Real-time security and compliance delivered.