
scanf
|
Here be dragons. |
gets
|
The function that cannot be used safely. |
fgets
|
A partial solution to the get-a-line problem. |
fgetword
|
A word at a time, and no word too long! |
fgetline
|
Read a line as long as your arm - or much, much longer. |
related links
|
Other pages dealing with this subject. |
| Back | Return to the index page for this section. |
One of the first challenges facing the neophyte C programmer
is that of obtaining data, either from the user, or from a
file. It is a matter of some concern (to me, at least) that
so many C teachers try to satisfy the student's need to get
this data by introducing the extraordinarily complex and subtle
scanf function and its close relative,
fscanf.
scanf function
Here is a typical "student" program that uses
the scanf function to read from the standard
input device. THIS CODE IS BUGGY! DO NOT USE IT!
#include <stdio.h>
int main(void)
{
char *buf;
scanf("%s", &buf); /* BUGS! */
printf("Hello %s", buf);
return 0;
}
This code has several problems. Firstly, it mistakenly passes the
address of the pointer to scanf. When this
program is run, the result is garbage. Hardly surprising, really.
Let's fix that (BUT THE PROGRAM IS STILL BROKEN!)...
#include <stdio.h>
int main(void)
{
char *buf;
scanf("%s", buf); /* BUGS! */
printf("Hello %s", buf);
return 0;
}
This code is still very poor. The programmer has made the
rather common mistake of thinking that char *
is the C way of spelling "string" - which is not true.
Unfortunately, it is entirely possible that the program will
"work" as the programmer expected it to.
When I compile this on my system and then run it, here is the output I get:
$> ./scanf
Richard
Hello Richard$>
Alas, it "works". And yet it is still broken. About
the worst problem the newbie C programmer might spot with the
result is that it fails to put the shell prompt on a new line.
He can, of course, fix that by putting a '\n' character into the
printf format string.
Having put that into the code, I compiled it using much stricter diagnostic checking, and here is the output my compiler provided:
$> gcc -W -Wall -ansi -pedantic -O2 -o scanf scanf.c
scanf.c: In function `main':scanf.c:5: warning: `buf' might be used uninitialized in this function
The real problem here is that we are telling scanf
to store a string at the address provided, but we haven't actually
allocated any storage in which to store that string. One easy fix
for this is to use an array (BUT THE PROGRAM IS STILL BROKEN!):
#include <stdio.h>
int main(void)
{
char buf[64];
scanf("%s", buf); /* BUGS! */
printf("Hello %s\n", buf);
return 0;
}
This is getting a bit better, but there are still some
problems. Firstly, we're not checking whether the
scanf call succeeded. That's no big deal
if it did succeed, but can be a very big deal
indeed if it didn't.
Secondly, we can't store any word with a length greater than or equal to the size of the array.
Thirdly, we can't guarantee that the user won't try to
exceed that limit - and scanf will do nothing
to stop the user from running straight over the end of the
array, either accidentally or maliciously. (In case you were
wondering, this is one of the ways you can make your code
vulnerable to a buffer overflow attack.)
This next version of the program fixes two of these problems - the first and the last:
#include <stdio.h>
int main(void)
{
char buf[64];
if(1 == scanf("%63s", buf))
{
printf("Hello %s\n", buf);
}
else
{
fprintf(stderr, "Input error.\n");
}
return 0;
}
This code is much better. Put in an explanatory comment or two, and I'd award you nine out of ten (if you were in your first week as a C programmer!).
But it still suffers from one problem - what if the input stream contains a word that is longer than the size of the array? We can stop the outsized word from corrupting memory easily enough; that's what the 63 is doing in "%63s". But we're still losing data. What we'd really like is to get the whole word in one fell swoop. Later on in this article, we'll design a solution to this problem.
If anyone has written an article on the robust use of
scanf, in all its hideous complexity,
and would like me to link to that article here, please
get in touch.
Actually, there's another problem that I haven't mentioned
yet - what if we want to get a whole line, rather than a
single word? Well, the C library provides a function to get
an entire line of input from the standard input device.
Unfortunately, this function, gets, is deeply flawed.
gets function
The gets function belongs to a bygone age, when
users behaved themselves, buffer overrun attacks were unheard
of, and programmers were less aware of the importance of robust
code. (I'm not sure whether such an age ever existed, but that's
another discussion for another time, if ever.)
gets takes as much data as it can find in the
standard input stream, up to either the end of the stream or
a newline character if that is encountered first. It reads and
discards the newline character, and stores all the characters
before that in consecutive memory locations, beginning at the
address you supply as an argument. If this turns out to be more
characters than you had memory for, well, that's your tough luck.
It is possible for an unscrupulous user of your code to exploit the
gets function for initiating a buffer overrun attack.
I'm not going to go into the details. You can find them easily enough on
the Web. But I will just mention that this is not a theoretical problem. Ever
since the infamous Internet Worm of 1988, malicious programmers have been
exploiting programs that use gets. The lessons are clear:
(1) protect your buffer! (2) never use gets because it makes
(1) impossible.
I will not demonstrate the use of the gets
function here. Why tempt fate?
fgets function
What we could really do with is a function like gets
but which accepts a parameter that specifies the size of the buffer, and
which guarantees not to write more than that many characters into your
buffer. There is such a function, of course - a standard C library
function named fgets. This function not only accepts a buffer
size parameter, but also a stream parameter - so you can use it for fetching
data from any text stream open for input! This is very useful indeed.
(If you just want data from the standard input device, use the standard
input stream pointer stdin as the third argument to
fgets.)
What happens if fgets encounters a line that is
longer than the buffer we provide? Well, it's very
simple - the function stops reading bytes from the input stream
as soon as the buffer is full. At that point, it writes a
'\0' character as the last character in the buffer
(no, it's all right, fgets knows to read only
n - 1 bytes of data, to leave room for the '\0').
Can we detect whether a complete line was read? Yes. If the character just
before the null terminating character of the populated buffer is
not a '\n' character, then we know that the line
is incomplete. To get the rest of the line, we can call fgets
again when we're ready for the rest of the data. (Yes, I agree that this
isn't exactly satisfactory. Have patience, and we'll get there in the end.)
If we know in advance the longest line we expect to encounter, we can use
fgets for very easy line-at-a-time processing. But there's a
hitch. Typically, we don't want the '\n' character on the end
of the line. Having said that, we do want to know that it's there (to assure
ourselves that a complete line has been read). This leads to some fairly
boiler-plate C code - some "spot the newline" detection code, and
a little helper function to replace a '\n' character with a
'\0' character.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAXLINE 128
int chomp(char *s)
{
int chomped = 0;
char *p = strchr(s, '\n');
if(p != NULL)
{
*p = '\0';
chomped = 1;
}
return chomped;
}
int main(void)
{
char buf[MAXLINE] = {0};
int rc = 0;
while(0 == rc && fgets(buf, sizeof buf, stdin) != NULL)
{
if(chomp(buf))
{
printf("Got the line [%s]\n", buf);
}
else
{
printf("Line too long! Aborting.\n");
rc = EXIT_FAILURE;
}
}
return rc;
}
As you can see, this example program quits early if it encounters data it can't handle. That's not really good enough for me, and I don't suppose it's good enough for you, either. So - what are we going to do about it? More to the point, why haven't ISO already does something about it? Why isn't there already a function to get a complete line from an input stream, irrespective of its length, in a memory-safe way?
Let's deal with the second point first (I only do this to annoy!). To get a complete line without knowing its length in advance, we have to find a way to get a buffer large enough, without going to the extent of specifying a fixed-size buffer of ludicrous proportions just on the off-chance that we might meet a very long line.
To achieve this, we can make use of dynamic memory allocation. It is certainly possible for a data acquisition function to resize a dynamic buffer as it goes along, always making sure that the buffer is large enough to accommodate the data it reads from the input stream.
Once we make up our minds to do this, however, we have to consider the function's interface with the caller. Do we simply return a pointer to a freshly-allocated buffer? This is certainly very tempting, but what if we call the function in a loop (which, typically, we will want to do)? To prevent memory leakage, the user would have to either copy the pointer safely away or release the memory before calling the function again.
Another possibility is for the function to maintain a buffer internally; this way, the calling code wouldn't have to worry about memory management - but on the other hand, how would we free the buffer when the program has finished its data acquisition? A special parameter? Maybe.
A third possibility is to pass the address of a pointer to a reallocable buffer into the function. This is quite a nice idea, because it means the function can re-use the buffer on consecutive calls, but it does mean that we need to keep track of the buffer size.
All these design decisions have a certain amount of merit, and there's no single, obvious, right answer. And that is why (or at least, I hope it's why) the standard C library doesn't include a function of this kind; whichever interface they chose, there would definitely be some people who thought it was the wrong choice! Also, of course, it's perfectly possible to implement a function of this kind using existing ISO C functions, so it's not unreasonable to leave such design choices to the individual programmer.
I have written two functions of this kind - one for reading a complete word (however large) from an input stream, and another for reading a complete line, again of arbitrary size, from the stream.
Let's look, first of all, at the function for getting a complete word at a time:
fgetword function
Here's the prototype for fgetword:
int fgetword(char ** word,
size_t * size,
const char * delimiters,
size_t maxrecsize,
FILE * fp,
unsigned int flags);
This hyperlink leads to fgetword.c
- you will also need fgetdata.h
which you should place in your include path.
The fgetword function reads a word at a time. A word
is defined as a sequence of characters that does not include any of the
characters in the delimiters argument you supply to the
function.
The function uses a reallocable buffer. You can get yourself a buffer
and pass it in if you wish, but there is no need, since fgetword
is perfectly capable of providing one for you. To take advantage of this,
you need only pass in the address of a char * that points to
NULL. If you do decide to use a buffer you allocated yourself,
you must know how big that buffer is, and you must tell
fgetword. You do this by populating a size_t
object with the exact capacity of the buffer and passing its address in as
the second argument.
Here is a fairly typical way to use fgetword:
#include <stdio.h>
#include <stdlib.h>
#include "fgetdata.h"
int main(void)
{
char *delimiters =" \t\r\n\f\v\a\b\\?'\"!%^&"
"*()=+/<>,.|[]{}#~";
char *line = NULL;
size_t size = 0;
while(0 == fgetword(&line,
&size,
delimiters,
(size_t)-1,
stdin,
0))
{
printf("Word found: [%s]\n", line);
}
free(line);
return 0;
}
As you can see, it is necessary to pass the address
of the buffer pointer, because the fgetword
function can (and sometimes does) need to change the location
of the buffer. This is why it is essential that you
don't use an auto or static buffer.
Note that the size information you give to fgetword
is updated within the routine, so you can find out how much memory
is tied up in the buffer. If you think it's too much, by all means
reduce it yourself using realloc or, alternatively,
pass the FGDATA_REDUCE flag as the last argument to
the function. This will cause fgetword to reduce the
buffer size to the minimum necessary to handle the current word.
Note that you have absolute control over the buffer size, via the
fourth parameter. If you don't want to limit the buffer size, set
this to (size_t)-1. If you want the buffer size to be
limited, this parameter is your chance to be strict. :-)
fgetline function
Here's the prototype for fgetline:
int fgetline(char **line,
size_t *size,
size_t maxrecsize,
FILE *fp,
unsigned int flags);
This hyperlink leads to fgetline.c
- you will also need fgetdata.h
which you should place in your include path.
The fgetline function reads a line at a time. It
is effectively equivalent in most respects to
fgetword(line, size, "\n", maxrecsize, fp, flags);
- the only difference being that whereas fgetword
treats an empty string as equivalent to end-of-file (think about
it!), the fgetline function will (correctly) retain
blank lines. This means it may be necessary to test line[0]
against '\0' before processing a line, depending on the
needs of your application.
Chuck Falconer's ggets function.
Morris Dovey's getsm function.
Eric Sosman's getline function.