Using awk for Log Analysis

Updated 10 hours ago

Grep asks: "Is this pattern here?"

Awk asks: "What's in column 7? How many times did each IP address appear? What's the average response time for 500 errors?"

That's the difference. Grep finds needles. Awk counts them, sorts them by size, and tells you which haystack they came from.

Awk Sees Structure

When awk reads a line, it automatically splits it into fields. Whitespace becomes invisible boundaries. Suddenly a log entry isn't a string—it's a data structure.

awk '{print $1}' /var/log/syslog

This prints the first field from every line. For syslog, that's the month. $2 is the day, $3 is the timestamp, and so on. $0 is the entire line.

Print multiple fields:

awk '{print $1, $2, $3}' /var/log/syslog

Commas add spaces between fields. Without commas, fields concatenate directly.

Custom Delimiters

Not all logs use whitespace. Specify a different field separator with -F:

awk -F':' '{print $1}' /etc/passwd

This splits on colons. For Apache access logs:

awk '{print $1, $7}' /var/log/apache2/access.log

Field 1 is the IP address. Field 7 is the requested URL. The log format determines which field holds what.

Pattern Matching

Awk can filter lines before processing them:

awk '/error/ {print $1, $2, $5}' /var/log/syslog

This finds lines containing "error" and prints specific fields from them. The pattern goes in slashes before the action block.

Filter by field values:

awk '$9 == 404 {print $7}' /var/log/apache2/access.log

This prints URLs that returned 404 errors. Field 9 is the status code in Apache's combined log format.

Associative Arrays: Awk's Superpower

This is where awk leaves grep behind entirely.

Count requests per IP address:

awk '{count[$1]++} END {for (ip in count) print ip, count[ip]}' /var/log/apache2/access.log

The array count uses IP addresses as keys. Each time an IP appears, its count increments. The END block runs after all lines are processed, printing the totals.

Sort by frequency:

awk '{count[$1]++} END {for (ip in count) print count[ip], ip}' /var/log/apache2/access.log | sort -nr | head -20

Putting counts first lets sort -nr show the highest values. This answers "who's hitting my server the hardest?" in one line.

Status code distribution:

awk '{status[$9]++} END {for (code in status) print code, status[code]}' /var/log/apache2/access.log

This shows how many requests returned each status code. Suddenly you can see your error rate at a glance.

Bandwidth per IP:

awk '{bandwidth[$1] += $10} END {for (ip in bandwidth) print ip, bandwidth[ip]}' /var/log/apache2/access.log

Field 10 is bytes transferred. This sums bandwidth consumption by IP address—useful for finding who's downloading your entire site.

Counting and Calculating

Awk maintains variables across lines:

awk '/error/ {count++} END {print count}' /var/log/syslog

This counts error lines. The variable persists, incrementing with each match.

Sum a field:

awk '{sum += $10} END {print sum}' /var/log/apache2/access.log

This totals bytes transferred—your bandwidth usage in one number.

Calculate averages:

awk '{sum += $10; count++} END {print sum/count}' /var/log/apache2/access.log

Average bytes per request. Combine operations with semicolons.

Error rate as a percentage:

awk '{total++; if ($9 >= 400) errors++} END {printf "Error rate: %.2f%%\n", (errors/total)*100}' /var/log/apache2/access.log

This computes what percentage of requests resulted in 4xx or 5xx errors.

Working with Timestamps

Filter by date:

awk '$1 == "Jan" && $2 == 15 {print $0}' /var/log/syslog

This shows only January 15th entries.

Filter by hour using regex matching:

awk '$3 ~ /^10:/ {print $0}' /var/log/syslog

The tilde (~) tests if a field matches a pattern. This finds entries from 10:00-10:59.

Find the busiest hour:

awk '{hour=substr($3,1,2); count[hour]++} END {for (h in count) print h, count[h]}' /var/log/syslog | sort -n

The substr function extracts characters—here, the first two characters of the timestamp (the hour).

String Functions

Extract substrings:

awk '{print substr($1,1,10)}' logfile

First 10 characters of field 1.

Filter by string length:

awk '{if (length($7) > 100) print $7}' /var/log/apache2/access.log

URLs longer than 100 characters—potentially suspicious requests.

Split fields further:

awk '{split($4,a,":"); hour=a[2]; count[hour]++} END {for (h in count) print h, count[h]}' /var/log/apache2/access.log

The split function divides a field by a delimiter into an array.

Formatted Output

Use printf for clean reports:

awk '{printf "%-15s %10d\n", $1, $10}' /var/log/apache2/access.log

IP addresses left-aligned in 15 characters, byte counts right-aligned in 10.

With headers:

awk 'BEGIN {print "IP Address       Count"} {count[$1]++} END {for (ip in count) print ip, count[ip]}' access.log

BEGIN runs before any lines are processed. END runs after all lines are processed.

Combining Conditions

awk '$9 == 404 && $10 > 1000 {print $7, $10}' /var/log/apache2/access.log

404 errors with response sizes over 1000 bytes—unusual, since 404 pages should be small.

awk '$9 >= 500 || $9 == 404 {print $9, $7}' /var/log/apache2/access.log

All server errors or 404s.

Saving Complex Programs

For serious analysis, save awk programs to files:

cat > analyze.awk << 'EOF'
{
    count[$1]++
    bandwidth[$1] += $10
}
END {
    for (ip in count) {
        printf "%s: %d requests, %d bytes\n", ip, count[ip], bandwidth[ip]
    }
}
EOF

awk -f analyze.awk /var/log/apache2/access.log

This makes complex logic readable and reusable.

The Mental Model

Awk transforms how you see text files. A log line stops being a string you search through and becomes a row in a database you can query.

The field references ($1, $2, $7) are your column names. Pattern matching is your WHERE clause. Associative arrays are GROUP BY. The END block computes your aggregates.

Once you see logs this way, questions that seemed impossible become one-liners:

"Which IPs are causing the most 500 errors?" Group by IP, filter by status, count.
"What's our 95th percentile response time?" Collect values, sort, pick the right index.
"When did traffic spike?" Group by minute, find the maximum.

Grep finds what you're looking for. Awk answers questions you didn't know you could ask.

Frequently Asked Questions About Awk

Was this page helpful?

😔

🤨

😃