星期日, 十二月 24, 2006

IrCOMM2k

http://www.ircomm2k.de/English/index.html

Virtual Infrared COM Port for Windows 2000/XP

IrCOMM2k is a driver for Windows 2000 and XP. It emulates a serial port which can be used to exchange data with mobile devices. For example, some cellular phones are able to act as modems or fax devices. PDAs with infrared interface can be synchronized with the PC.



Everyone is free to use and redistribute IrCOMM2k. Furthermore, the source code is available. IrCOMM2k is an Open Source project according to the terms of the GPL.

 

文书编辑器介绍 ---「VI」

文书编辑器介绍 ---「VI」

  vi ( m ) 在大多数的 unix 系统中 ( 包括 linux ) 都内建 vi ( m ) 编辑器。如果您即将成为 linux 系统管理者,或是长时间在 linux 环境中工作,您最好学会 vi ( m ),因为您迟早会使用到它,由其是系统出状况时。对新手而言,第一次使用 vi ( m ) 的过程是非常痛苦的,甚至讨厌它,因为它的使用方式与一般辑编器完全不同。但是,一但您学会如何操作它时,您会对它爱不释手,因为它的功能实在太强了。vi ( m ) 就是这样,让人又爱又恨。

  事实上 linux 系统中的 vi 其实是 vim。vi 与 vim 的用法很像,因为它是 vi 的增强版,因此 vi 有什麽功能 vim 都有,而且 vim 新增许多 vi 没有的功能,而且比 vi 容易使用。

7.4.1 vi ( m ) 的叁种模式简介
一般模式
  在一般模式下,您所能进行的动作只有移动游标屏幕上的位置,标记、删除、搬移、复制字符或文字区块。此外您可以由命令模式进人输入模式虽命令列模式。
输入模式
  只有在输入模式下,才可进行文字资料输入,按 ESC 键可以回到一般模式。
命令列模式
  将文件写入或离开编辑器,亦可设定编辑环境,如寻找字串、列出行号等。
7.4.2 vi ( m ) 的基本操作
进人 vi
  在系统提示符号下输入 vi 及文件名称后即进入 vi 全屏幕编辑画面,且在一般模式下。输入模式下编辑文件:在一般模式下可按 'i' 或 'a' 或 'o' 叁键进入编辑模式。
"i" insert - 在目前游标之处插入所要输入之文字
"a" append - 在目前游标所在的下一行开始输入文字
"o" open - 新增一行在游标的下,并从行首开始输人文字
离开 vi 及保存
  在一般模式下可按 ':' 键进人命令列模式:
:w filename ( 存入指定文件 )
:wq ( 写入并离开 vi )
:q! ( 强迫离开并放弃编辑的文件 )
:n [ 文件 ] - 引入下一个文件
:l,l2d - 删除自行l至行 l2 的文字
:l,l2s/patternl/pattern2/g - 将行 l 至行 l2 的文字, patternl 的字串改为 pattern2 的字串,如无 g 则仅更换每一行所匹配的第一个字串,如有 g 则将每一个字串均做更换
7.4.3 一般模式功能键简介
移动游标
  h、j、k、l:分别控制游标左、下、土、右移一格
Ctrl+b : 往前一页
Ctrl+f : 往后一页
G : 移到文件最后
w : 移到下个 word 的开头
b : 跳至上个 wore 的开头
删除
x:删除所在后面一个字符
#x:ex:5x 表删除 5 个字符
dd:删除游标所在之行
#dd:例,5dd 表删除自游标算起之 5 行
更改
cw:更改游标处之字到此一单字之字尾处
c#w:例,c3w表更改 3 个字
取代
r:取代游标处之字符
R:取代字符直到按 ESC 为止
复制
yw:拷贝游标处之字到字尾至缓冲区
P:把缓冲区之资料贴上来
yy:拷贝游标所在之行至缓冲区
#yy:ex:5yy,拷贝游标所在之处以下 5 行至缓冲区
复原
  u:undo,复原至上一动作
跳至指定之行
  g:列出行号,及相关信息
7.4.4 命令列下命令简介
  注意:使用前请先按 ESC 键确定在一般模式下按 ':' 或 '/' 或 '?' 叁键即可进入命令列模式
列出行号
 :set nu
寻找字串
word ( 由首至尾寻找 ),按 n 可往下继续找
word ( 由尾至首寻找 ),按 N 可往前继续找
跳行
 :100 - 可跳至第 100 行











 

星期四, 十二月 21, 2006

几种Apache CGI模块性能简单分析比较 被不带出处转载

今天, 在网上搜索LUA的相关信息,发现自己blog里面的 几种Apache CGI模块性能简单分析比较 在
http://www.mylinux.com.cn/newsTextAction.do?id=15,045 处出现,进去看了一下,居然连转载申明也没有,这样的编辑还真的能搞!

星期六, 十二月 16, 2006

memcached

memcached homepage http://www.danga.com/memcached/
memcached for Win32 http://jehiah.cz/projects/memcached-win32/

Client APIS http://www.danga.com/memcached/apis.bml

Protocol
--------

Clients of memcached communicate with server through TCP connections.
(A UDP interface is also available; details are below under "UDP
protocol.") A given running memcached server listens on some
(configurable) port; clients connect to that port, send commands to
the server, read responses, and eventually close the connection.

There is no need to send any command to end the session. A client may
just close the connection at any moment it no longer needs it. Note,
however, that clients are encouraged to cache their connections rather
than reopen them every time they need to store or retrieve data. This
is because memcached is especially designed to work very efficiently
with a very large number (many hundreds, more than a thousand if
necessary) of open connections. Caching connections will eliminate the
overhead associated with establishing a TCP connection (the overhead
of preparing for a new connection on the server side is insignificant
compared to this).

There are two kinds of data sent in the memcache protocol: text lines
and unstructured data. Text lines are used for commands from clients
and responses from servers. Unstructured data is sent when a client
wants to store or retrieve data. The server will transmit back
unstructured data in exactly the same way it received it, as a byte
stream. The server doesn't care about byte order issues in
unstructured data and isn't aware of them. There are no limitations on
characters that may appear in unstructured data; however, the reader
of such data (either a client or a server) will always know, from a
preceding text line, the exact length of the data block being
transmitted.

Text lines are always terminated by \r\n. Unstructured data is _also_
terminated by \r\n, even though \r, \n or any other 8-bit characters
may also appear inside the data. Therefore, when a client retrieves
data from a server, it must use the length of the data block (which it
will be provided with) to determine where the data block ends, and not
the fact that \r\n follows the end of the data block, even though it
does.

Keys
----

Data stored by memcached is identified with the help of a key. A key
is a text string which should uniquely identify the data for clients
that are interested in storing and retrieving it. Currently the
length limit of a key is set at 250 characters (of course, normally
clients wouldn't need to use such long keys); the key must not include
control characters or whitespace.

Commands
--------

There are three types of commands.

Storage commands (there are three: "set", "add" and "replace") ask the
server to store some data identified by a key. The client sends a
command line, and then a data block; after that the client expects one
line of response, which will indicate success or faulure.

Retrieval commands (there is only one: "get") ask the server to
retrieve data corresponding to a set of keys (one or more keys in one
request). The client sends a command line, which includes all the
requested keys; after that for each item the server finds it sends to
the client one response line with information about the item, and one
data block with the item's data; this continues until the server
finished with the "END" response line.

All other commands don't involve unstructured data. In all of them,
the client sends one command line, and expects (depending on the
command) either one line of response, or several lines of response
ending with "END" on the last line.

A command line always starts with the name of the command, followed by
parameters (if any) delimited by whitespace. Command names are
lower-case and are case-sensitive.

Expiration times
----------------

Some commands involve a client sending some kind of expiration time
(relative to an item or to an operation requested by the client) to
the server. In all such cases, the actual value sent may either be
Unix time (number of seconds since January 1, 1970, as a 32-bit
value), or a number of seconds starting from current time. In the
latter case, this number of seconds may not exceed 60*60*24*30 (number
of seconds in 30 days); if the number sent by a client is larger than
that, the server will consider it to be real Unix time value rather
than an offset from current time.


Error strings
-------------

Each command sent by a client may be answered with an error string
from the server. These error strings come in three types:

- "ERROR\r\n"

means the client sent a nonexistent command name.

- "CLIENT_ERROR \r\n"

means some sort of client error in the input line, i.e. the input
doesn't conform to the protocol in some way. is a
human-readable error string.

- "SERVER_ERROR \r\n"

means some sort of server error prevents the server from carrying
out the command. is a human-readable error string. In cases
of severe server errors, which make it impossible to continue
serving the client (this shouldn't normally happen), the server will
close the connection after sending the error line. This is the only
case in which the server closes a connection to a client.


In the descriptions of individual commands below, these error lines
are not again specifically mentioned, but clients must allow for their
possibility.


Storage commands
----------------

First, the client sends a command line which looks like this:

\r\n

- is "set", "add" or "replace"

"set" means "store this data".

"add" means "store this data, but only if the server *doesn't* already
hold data for this key".

"replace" means "store this data, but only if the server *does*
already hold data for this key".

- is the key under which the client asks to store the data

- is an arbitrary 16-bit unsigned integer (written out in
decimal) that the server stores along with the data and sends back
when the item is retrieved. Clients may use this as a bit field to
store data-specific information; this field is opaque to the server.

- is expiration time. If it's 0, the item never expires
(although it may be deleted from the cache to make place for other
items). If it's non-zero (either Unix time or offset in seconds from
current time), it is guaranteed that clients will not be able to
retrieve this item after the expiration time arrives (measured by
server time).

- is the number of bytes in the data block to follow, *not*
including the delimiting \r\n. may be zero (in which case
it's followed by an empty data block).

After this line, the client sends the data block:

\r\n

- is a chunk of arbitrary 8-bit data of length
from the previous line.

After sending the command line and the data blockm the client awaits
the reply, which may be:

- "STORED\r\n", to indicate success.

- "NOT_STORED\r\n" to indicate the data was not stored, but not
because of an error. This normally means that either that the
condition for an "add" or a "replace" command wasn't met, or that the
item is in a delete queue (see the "delete" command below).


Retrieval command:
------------------

The retrieval command looks like this:

get *\r\n

- * means one or more key strings separated by whitespace.

After this command, the client expects zero or more items, each of
which is received as a text line followed by a data block. After all
the items have been transmitted, the server sends the string

"END\r\n"

to indicate the end of response.

Each item sent by the server looks like this:

VALUE \r\n
\r\n

- is the key for the item being sent

- is the flags value set by the storage command

- is the length of the data block to follow, *not* including
its delimiting \r\n

- is the data for this item.

If some of the keys appearing in a retrieval request are not sent back
by the server in the item list this means that the server does not
hold items with such keys (because they were never stored, or stored
but deleted to make space for more items, or expired, or explicitly
deleted by a client).



Deletion
--------

The command "delete" allows for explicit deletion of items:

delete

星期三, 十一月 22, 2006

http://pl.atyp.us/content/tech/servers.html

Introduction

The purpose of this document is to share some ideas that I've developed over the years about how to develop a certain kind of application for which the term "server" is only a weak approximation. More accurately, I'll be writing about a broad class of programs that are designed to handle very large numbers of discrete messages or requests per second. Network servers most commonly fit this definition, but not all programs that do are really servers in any sense of the word. For the sake of simplicity, though, and because "High-Performance Request-Handling Programs" is a really lousy title, we'll just say "server" and be done with it.

I will not be writing about "mildly parallel" applications, even though multitasking within a single program is now commonplace. The browser you're using to read this probably does some things in parallel, but such low levels of parallelism really don't introduce many interesting challenges. The interesting challenges occur when the request-handling infrastructure itself is the limiting factor on overall performance, so that improving the infrastructure actually improves performance. That's not often the case for a browser running on a gigahertz processor with a gigabyte of memory doing six simultaneous downloads over a DSL line. The focus here is not on applications that sip through a straw but on those that drink from a firehose, on the very edge of hardware capabilities where how you do it really does matter.

Some people will inevitably take issue with some of my comments and suggestions, or think they have an even better way. Fine. I'm not trying to be the Voice of God here; these are just methods that I've found to work for me, not only in terms of their effects on performance but also in terms of their effects on the difficulty of debugging or extending code later. Your mileage may vary. If something else works better for you that's great, but be warned that almost everything I suggest here exists as an alternative to something else that I tried once only to be disgusted or horrified by the results. Your pet idea might very well feature prominently in one of these stories, and innocent readers might be bored to death if you encourage me to start telling them. You wouldn't want to hurt them, would you?

The rest of this article is going to be centered around what I'll call the Four Horsemen of Poor Performance:

  1. Data copies
  2. Context switches
  3. Memory allocation
  4. Lock contention

There will also be a catch-all section at the end, but these are the biggest performance-killers. If you can handle most requests without copying data, without a context switch, without going through the memory allocator and without contending for locks, you'll have a server that performs well even if it gets some of the minor parts wrong.

Data Copies

This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it's obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that's true for me, but I digress. Nowadays it's covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that "zero copy" is a good buzzword.

Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It's probably more than you think. Guess what "Programmed I/O" on a PC refers to. An example of a copy that's disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it's pointed out that hashing is effectively "copying plus" it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put "zero-copy operation" on your hacker-conference slides, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such.

The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:

  • A pointer and length for the whole buffer.
  • A pointer and length, or offset and length, for the part of the buffer that's actually filled.
  • Forward and back pointers to other buffer descriptors in a list.
  • A reference count.

Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it's easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.

I don't actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don't need to be copied, and not sweat too much about the other stuff.

This brings me to my last point about data copies: don't go overboard avoiding them. I've seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you're looking for places to avoid redundant operations they're one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.

Context Switches

Whereas everyone thinks it's obvious that data copies are bad, I'm often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total "meltdowns" at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.

The first thing that a "thread-frugal" program has to do is figure out how it's going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many "religious wars" have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel's C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn't matter that much. They all - except maybe select() - work reasonably well, and don't really do much to address the matter of what happens past the very outermost layer of your program's front end.

The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more "listener" threads and put on queues, from which one or more "worker" threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error by requiring that the response to a request be sent by the original thread - guaranteeing not one but two context switches per request. It's very important to use a "symmetric" approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.

Usually, it's not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or "background" threads dedicated to various maintenance tasks could pick that moment to wake up. If you don't know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it's doing "real work". If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won't continue contending for resources until one of the existing threads "retires" so the system effect is negligible. More importantly, this method handles maintenance threads - which sleep most of the time and therefore dont' count against the active thread count - more gracefully than most alternatives.

Once the processing of requests has been broken up into two stages (listener and worker) with multiple threads to service the stages, it's natural to break up the processing even further into more than two stages. In its simplest form, processing a request thus becomes a matter of invoking stages successively in one direction, and then in the other (for replies). However, things can get more complicated; a stage might represent a "fork" between two processing paths which involve different stages, or it might generate a reply (e.g. a cached value) itself without invoking further stages. Therefore, each stage needs to be able to specify "what should happen next" for a request. There are three possibilities, represented by return values from the stage's dispatch function:

  • The request needs to be passed on to another stage (an ID or pointer in the return value).
  • The request has been completed (a special "request done" return value)
  • The request was blocked (a special "request blocked" return value). This is equivalent to the previous case, except that the request is not freed and will be continued later from another thread.

Note that, in this model, queuing of requests is done within stages, not between stages. This avoids the common silliness of constantly putting a request on a successor stage's queue, then immediately invoking that successor stage and dequeuing the request again; I call that lots of queue activity - and locking - for nothing.

If this idea of separating a complex task into multiple smaller communicating parts seems familiar, that's because it's actually very old. My approach has its roots in the Communicating Sequential Processes concept elucidated by C.A.R. Hoare in 1978, based in turn on ideas from Per Brinch Hansen and Matthew Conway going back to 1963 - before I was born! However, when Hoare coined the term CSP he meant "process" in the abstract mathematical sense, and a CSP process need bear no relation to the operating-system entities of the same name. In my opinion, the common approach of implementing CSP via thread-like coroutines within a single OS thread gives the user all of the headaches of concurrency with none of the scalability.

A contemporary example of the staged-execution idea evolved in a saner direction is Matt Welsh's SEDA. In fact, SEDA is such a good example of "server architecture done right" that it's worth commenting on some of its specific characteristics (especially where those differ from what I've outlined above).

  1. SEDA's "batching" tends to emphasize processing multiple requests through a stage at once, while my approach tends to emphasize processing a single request through multiple stages at once.
  2. SEDA's one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only "background" reallocation of threads between stages in response to load. As a result, the #1 and #2 causes of context switches noted above are still very much present.
  3. In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.

Memory Allocation

Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.

Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process. Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.

Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they're needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.

It's generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it's usually necessary to have some sort of periodic "sweeper" task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked "old" and "new" lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:

  1. Lock both lists.
  2. Save the head for the old list.
  3. Make the (previously) new list into the old list by assigning list heads.
  4. Unlock.
  5. Free everything on the saved old list at leisure.

Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.

One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I'd use lookaside lists for already contain list pointers anyway, so it's kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.

Suggestion #3 actually has to do with locking, which we haven't discussed yet, but I'll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there's absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.

Lock Contention

Efficient locking schemes are notoriously hard to design, because of what I call Scylla and Charybdis after the monsters in the Odyssey. Scylla is locking that's too simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability; Charybdis is overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance. Near Scylla are shoals representing deadlock and livelock conditions; near Charybdis are shoals representing race conditions. In between, there's a narrow channel that represents locking which is both efficient and correct...or is there? Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.

Almost every locking scheme starts off as "one big lock around everything" and a vague hope that performance won't suck. When that hope is dashed, and it almost always is, the big lock is broken up into smaller ones and the prayer is repeated, and then the whole process is repeated, presumably until performance is adequate. Often, though, each iteration increases complexity and locking overhead by 20-50% in return for a 5-10% decrease in lock contention. With luck, the net result is still a modest increase in performance, but actual decreases are not uncommon. The designer is left scratching his head (I use "his" because I'm a guy myself; get over it). "I made the locks finer grained like all the textbooks said I should," he thinks, "so why did performance get worse?"

In my opinion, things got worse because the aforementioned approach is fundamentally misguided. Imagine the "solution space" as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the "one big lock" starting point is almost always separated from the higher peaks by all manner of valleys, saddles, lesser peaks and dead ends. It's a classic hill-climbing problem; trying to get from such a starting point to the higher peaks only by taking small steps and never going downhill almost never works. What's needed is a fundamentally different way of approaching the peaks.

The first thing you have to do is form a mental map of your program's locking. This map has two axes:

  • The vertical axis represents code. If you're using a staged architecture with non-branching stages, you probably already have a diagram showing these divisions, like the ones everybody uses for OSI-model network protocol stacks.
  • The horizontal axis represents data. In every stage, each request should be assigned to a data set with its own resources separate from any other set.

You now have a grid, where each cell represents a particular data set in a particular processing stage. What's most important is the following rule: two requests should not be in contention unless they are in the same data set and the same processing stage. If you can manage that, you've already won half the battle.

Once you've defined the grid, every type of locking your program does can be plotted, and your next goal is to ensure that the resulting dots are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond-cutter, using your knowledge of what the program does to find the natural "cleavage lines" between stages and data sets. Sometimes they're obvious to start with. Sometimes they're harder to find, but seem more obvious in retrospect. Dividing code into stages is a complicated matter of program design, so there's not much I can offer there, but here are some suggestions for how to define data sets:

  • If you have some sort of a block number or hash or transaction ID associated with requests, you can rarely do better than to divide that value by the number of data sets.
  • Sometimes, it's better to assign requests to data sets dynamically, based on which data set has the most resources available rather than some intrinsic property of the request. Think of it like multiple integer units in a modern CPU; those guys know a thing or two about making discrete requests flow through a system.
  • It's often helpful to make sure that the data-set assignment is different for each stage, so that requests which would contend at one stage are guaranteed not to do so at another stage.

If you've divided your "locking space" both vertically and horizontally, and made sure that lock activity is spread evenly across the resulting cells, you can be pretty sure that your locking is in pretty good shape. There's one more step, though. Do you remember the "small steps" approach I derided a few paragraphs ago? It still has its place, because now you're at a good starting point instead of a terrible one. In metaphorical terms you're probably well up the slope on one of the mountain range's highest peaks, but you're probably not at the top of one. Now is the time to collect contention statistics and see what you need to do to improve, splitting stages and data sets in different ways and then collecting more statistics until you're satisfied. If you do all that, you're sure to have a fine view from the mountaintop.

Other Stuff

As promised, I've covered the four biggest performance problems in server design. There are still some important issues that any particular server will need to address, though. Mostly, these come down to knowing your platform/environment:

  • How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
  • How efficient is the network protocol you're using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
  • Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
  • What's your page size? What's your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
  • Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have "thundering herd" problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?

I'm sure I could think of many more questions in this vein. I'm sure you could too. In any particular situation it might not be worthwhile to do anything about any one of these issues, but it's usually worth at least thinking about them. If you don't know the answers - many of which you will not find in the system documentation - find out. Write a test program or micro-benchmark to find the answers empirically; writing such code is a useful skill in and of itself anyway. If you're writing code to run on multiple platforms, many of these questions correlate with points where you should probably be abstracting functionality into per-platform libraries so you can realize a performance gain on that one platform that supports a particular feature.

The "know the answers" theory applies to your own code, too. Figure out what the important high-level operations in your code are, and time them under different conditions. This is not quite the same as traditional profiling; it's about measuring design elements, not actual implementations. Low-level optimization is generally the last resort of someone who screwed up the design.

  1. 10K并发连接问题

此文是 http://www.kegel.com/c10k.html 的翻译与理解。

已经是WEB服务器解决10K并发连接的时候了,你不这么认为吗?毕竟,WEB是现拥有最大的地盘。


现在的的计算机也大了。1千200美元左右可以买到1G的机器,装有2G的内存与千兆网卡。我们看看,如果有20K客户端,那么每个客户端分得50KHz CPU,100K字节内存,50Kbits/sec的带宽。那就意味着对于20K客户端中的一个不应该花费比每秒从磁盘读取4K字节并发送到网络更多的马力。(这种情况下每个客户端花费8美分。那些每客户端100美元的服务器许可证费用看上去贵了点)。所以,硬件不再是瓶颈。(老外就是细致,这都算的这么清楚!)


在1999年,最忙的一台FTP站点,cdrom.com, 实际上已经通过G比特以太网并发处理了10K的连接。到了2001年,相同的速度的服务已经在几个ISP中提供,它们希望能够为规模不断扩大的商业用户提供服务。


瘦客户端模型的计算模型在形式上又回来了,这时网络上工作的服务器,服务者数以千计的客户。


我们要知道,这里是一些如何配置操作系统以及写代码支持以千计客户端的笔记。讨论集中在类Unix操作系统,那也是我个人感兴趣的领域,但是Windows也有所涉及。



目录

  • 10K连接问题
  • 相关站点
  • 首先该读的书
  • I/O框架
  • I/O策略
    1. 每一线程的多服务多客户端,使用非阻塞I/O与读触发通知
      • 传统select()方法
      • 传统poll()方法
      • /dev/poll方法(Solaris 2.7+)
      • kqueue方法(FreeBSD, NetBSD)
    2. 每一线程的多服务客户端,是用非阻塞I/O与读更改通知
      • epoll(Linux 2.6+)
      • Polyakov's Kevent(Linux 2.6+)
      • Drepper's New Network Interface(proposal for Linux 2.6+)
      • Realtime Signals(Linux 2.4+)
      • Signal-per-fd
      • kqueue(FreeBSD,NetBSD)
    3. 每一线程的多服务器客户端,使用异步I/O与完成通知
    4. 每一服务线程服务一个客户端
      • Linux/Threads(Linux 2.0+)
      • NGPT(Linux 2.4+)
      • NPTL (Linux 2.6, Red Hat 9)
      • FreeBSD threading support
      • NetBSD threading support
      • Solaris threading support
      • Java threading support in JDK 1.3.x and earlier
      • Note: 1:1 threading vs. M:N threading
    5. 建立服务器代码到内核
  • 注释
  • 打开文件句柄的限制
  • 线程限制
  • Java问题
  • 其它提示
    • 零拷贝
    • sendfile()系统调用可实现零拷贝网络操作
    • 杜绝小侦数据使用writev or TCP CORK
    • 获益于非Posix线称的程序
    • 缓存你的数据有时也会赢得胜利
  • 其它限制
  • 内核问题
  • 测量服务器性能
  • 事例
    • 对select()基础的服务器的兴趣
    • 对/dev/poll基础的服务器的兴趣
    • 对kqueue()基础的服务器的兴趣
    • 对实时信号基础的服务器的兴趣
    • 对线称基础的服务器的兴趣
    • 对内核内服务器的兴趣
  • 其它感兴趣连接


相关站点

在2003年10月,Felix von Leitner 整理了优秀的关于网络可测量性的网页和介绍,完成了比较多种网络系统调用和操作系统的数据的标键。他报告中的一个观点就是linux 2.6的内核实际上打败了2.4的内核,但是还有很多,许多好的图像将会给开发者们食物用于考虑一段时间!(参看Slashdot的注释,主要关注于是否任何人都会跟随基准提高Felix的结果!

首先阅读的书

如果你还没有读它,就立即出发购买一份. Unix Network Programming:Networking APIs. Sockets and Xit(Volume1),Unix 网络编程, 网络接口,套接字与Xit。 W. Richard Stevens的著作. 他描述了许多I/O策略与写高性能服务器的相关缺陷. 他甚至讨论了 "thundering herd" 问题, 当你度过之后,开始阅读 Jeff Darcy's的高性能服务器设计的笔记.

另外一本可能有所帮助的书籍对于使用者比编写Web服务器的人的更为重要, 建立扩展性网站!

I/O框架

抽象了一些技术的预打包的库是实现了一些I/O框架,它们可以隔离你的代码与服务器,并且使代码更容易移植.

  • ACE 重量级的C++ I/O框架,包含了面向对象的一些I/O策略的实现以及其它许多有价值的东西。具体一些 它的Reactor是以OO的方式来处理非阻塞I/O,Proactor是以OO的方式处理异步I/O.
  • ASIO 将成为BOOST库成员的C++ I/O框架. 它像是ACE在STL时代的升级.
  • libevent Niels Provos的 轻量级的 C I/O框架, 支持kqueue与select, 很快将会支持poll与epoll. 它仅仅是触发, 作者认为,那同时具有好坏两方面. Niels 的处理时间所花时间的漂亮的图像作为了连接数量的功能. 它显示出kqueue 与 sys_epoll很明显获胜.
  • 作者自己的轻量级框架(遗憾的是,并不是最新的)
    • Pooler 是轻量级的 C++ I/O库,它实现了触发读API,采用了底层的你想要的读操作API(poll, select, /dev/poll kqueue, or sigio). 被不同API性能的比较结果是有价值的,此文档链接的Pooler子类举例说明了不同读操作API的用法
    • rn是轻量级的 C I/O操作框架,是作者在Poller之后的另一个尝试. lgpl的版权(所以,方便使用在商业应用中)
  • MattWelsh 在2000年4月写的论文. 描述了如何在工作线称与事件驱动技术之间取得平衡在建立可扩展服务器的时候.该论文描述了他的Sandstorm的I/O框架.
  • CoryNelson's Sacle! Library 为WIndows提供的异步Socket, 文件,与管道I/O库!

I/O策略

代码静态分析工具PC-LINT安装配置--step by step[转]

http://www.yuanma.org/data/2006/0529/article_512.htm

PC-Lint是C/C++软件代码静态分析工具,你可以把它看作是一种更加严格的编译器。它不仅可以检查出一般的语法错误,还可以检查出那些虽然符合语法要求但不易发现的潜在错误。 C语言的灵活性带来了代码效率的提升,但相应带来了代码编写的随意性,另外C编译器不进行强制类型检查,也带来了代码编写的隐患。PCLint识别并报告C语言中的编程陷阱和格式缺陷的发生。它进行程序的全局分析,能识别没有被适当检验的数组下标,报告未被初始化的变量,警告使用空指针,冗余的代码,等等。软件除错是软件项目开发成本和延误的主要因素。PClint能够帮你在程序动态测试之前发现编码错误。这样消除错误的成本更低。 使用PC-Lint在代码走读和单元测试之前进行检查,可以提前发现程序隐藏错误,提高代码质量,节省测试时间。并提供编码规则检查,规范软件人员的编码行为。由于PC-LINT对于一般程序员来说可能比较陌生,有好多人安装了也不知道怎样配置和使用。 下面我就根据自己的安装和配置心得对PC-Lint的安装、配置及使用进行下详细说明.本人主要介绍了将PC-Lint集成到VC++6.0和SourceInsight的方法和步骤。
(一)Windows下C/C++开发工具中,VC6使用较为普遍,因此这里先讲下VC6.0环境中集成pclint的步骤.
首先, 当然要下载软件,正版软件要200多$呢,买不起!所以只好网上找免费的拉。从http://www.61ic.com/down/othe/pclint.rar处可以下载到一个8.0版本的pclint.
1.将pclint.rar解压至c:\, 这样lint文件就位与c:\pclint(安装目录)下了。2.将c:\pclint\lnt 下的3个文件lib-w32.lnt,env-vc6.lnt,co-msc60.lnt拷贝至c:\pclint下, 再在安装目录下创建std.lnt和options.lnt两个文件,其中std.lnt的内容如下
// contents of std.lnt
c:\pclint\co-msc60.lntc:\pclint\lib-w32.lntc:\pclint\options.lnt -si4 -sp4-i"D:\Program Files;D:\Program Files\Microsoft Visual Studio\VC98\Include"
//end
其中-i后面的路径名为VC的安装路径和VC Include 文件路径,根据自己的修改便可。
options.lnt 内容可为空,为定制内容,以后需要时再添加。
准备工作做完了,下一步就是要将pclint集成到VC6中去,先配置lint使之能对单个C或C++文件进行检查。
1.打开VC6,tools--->customize-->tools 新建一个名为pclint的项,在下面填入command: C:\pclint\lint-nt.exearguments: -u c:\pclint\std.lnt c:\pclint\env-vc6.lnt "$(FilePath)"
Use Output Window 打上勾close 完成。 这个在你VC窗口tools菜单下应该多了一个pclint选项,可以用它来运行lint程序,对你的c/c++代码进行静态检查了。
现在就可以用个小程序测试一下pclint了//test1.cpp#include
class X
{
int *p;
public:
X()
{ p = new int[20]; }
void init()
{ memset( p, 20, 'a' ); }
~X()
{ delete p; }
};
编译这个文件,看下你的编译器给你多少警告,再运行下lint, 可以自己对比一下。我的机器上,VC产生0 errors 0 warnings, 而lint程序产生了如下8条警告信息,有些还是很有用处的提示,这里就不一一分析了.
test.cpp(12): error 783: (Info -- Line does not end with new-line)test.cpp(7): error 1732: (Info -- new in constructor for class 'X' which has no assignment operator)test.cpp(7): error 1733: (Info -- new in constructor for class 'X' which has no copy constructor) { memset( p, 20, 'a' ); }test.cpp(9): error 669: (Warning -- Possible data overrun for function 'memset(void *, int, unsigned int)', argument 3 (size=97) exceeds argument 1 (size=80) [Reference: test.cpp: lines 7, 9])test.cpp(7): error 831: (Info -- Reference cited in prior message)test.cpp(9): error 831: (Info -- Reference cited in prior message) { delete p; }test.cpp(11): error 424: (Warning -- Inappropriate deallocation (delete) for 'new[]' data)
--- Wrap-up for Module: test.cpp
test.cpp(2): error 753: (Info -- local class 'X' (line 2, file test.cpp) not referenced)Tool returned code: 8
2.通常一个VC项目中包含多个C或C++文件,有时需要同时对这一系列的文件进行lint检查,我们可以通过配置一个pclint_project来达到目的。
和前面第一步中的方法基本一样,不过这里我们需要用到unix中的find等命令来查找当前目录下的C和C++文件,然后再将它们送给lint程序处理,所以得先从http://www.weihenstephan.de/~syring/win32/UnxUtils.zip下载UnxUtils.zip.接着按下列步骤进行:
(i)解压UnxUtils.zip至c:\unix下, 可以看到C:\unix\usr\local\wbin有很多unix下的命令,等下会用到
(ii)打开VC6,tools--->customize-->tools 新建一个名为pclint_project的项,只不过下面的commands和arguments内容不同。
commands: C:\unix\usr\local\wbin\find.exearguments: $(FileDir) -name *.c -o -name *.cpp C:\unix\usr\local\wbin\xargs lint-nt -i"c:\unix\usr\local" -u c:\pclint\std.lnt c:\pclint\env-vc6.lnt
(iii)Use Output Window打上勾,close退出。好了,这时VC tools菜单下应该又多了一个pclint_project项了,你以后可以用它来对一个VC项目运行lint检查程序了.
(二)SourceInsight中集成pclint程序的方法.
Windows平台下也有好多人都喜欢用SourceInsight编辑C/C++程序,如果将pclint集成到SourceInsight中,那就相当于给SourceInsight增加了一个C/C++编译器,而且它的检查更严格,能发现一些编译器发现不了的问题,可以大大减少程序中潜伏的BUG。这样的话,相信更多人会喜欢SourceInsight这个工具了。
下面简要地介绍下pclint集成到SourceInsight中的方法
有了上面VC中集成pclint的经验, 下面的事情就应该比较轻松了,(a)打开你的SourceInsight, 选择Options-->Custom Commands-->Add, 输入pclint(当然名字可以随便).
(b) Run中输入: c:\pclint\lint-nt -u c:\pclint\std.lnt c:\pclint\env-vc6.lnt %f
(c)Dir留空,将Iconic Window, Capture Output, Parse Links in OutPut, File,then Line 四项前打上勾。
(d)然后点右侧 Menu--->Menu-->View-->, 右侧Insert, OK.
(e)此时在SourceInsight中的View菜单下多了个pclint选项,可以用它来对单个C/C++文件进行静态检查。
用类似的方法可以配置对一个SourceInsight工程文件的lint检查。
(a)打开你的SourceInsight, 选择Options-->Custom Commands-->Add, 输入pclint_project(当然名字可以随便).
(b) Run中输入: C:\unix\usr\local\wbin\find.exe %d -name *.c -o -name *.cpp C:\unix\usr\local\wbin\xargs lint-nt -i"C:\unix\usr\local" -u c:\pclint\std.lnt c:\pclint\env-vc6.lnt
(c)Dir留空,将Iconic Window, Capture Output, Parse Links in OutPut, File,then Line 四项前打上勾。
(d)然后点右侧 Menu--->Menu-->View-->, 右侧Insert, OK.
(e)此时在SourceInsight中的View菜单下多了个pclint_project选项,可以用它来一个工程中的C/C++文件进行静态检查。
本文主要对pclint集成到VC及SourceInsight环境中的方法根据本人安装和使用心得做了较详细介绍,希望对以前没使用过pclint的朋友们能有所帮助,不足之处还请多指正!

星期五, 十月 20, 2006

正则表达式的一些文章与工具

http://www.cnblogs.com/Laser_Lu/archive/2005/04/21/142605.aspx

http://aspxboy.com/private/508/default.aspx

http://www.codeproject.com/dotnet/regextutorial.asp

http://www.codeproject.com/dotnet/expresso.asp

http://www.radsoftware.com.au/regexdesigner/

星期二, 十月 17, 2006

维基百科

维基百科今天下午好像解除封锁,庆贺! http://www.wikipedia.org/
后来又刷新了一下,发现不行,还在封锁中

P2P 赢利模式分析

在刚才浏览的一篇涉及网络安全的文章中,提到了网络安全所要面临的新的问题,没有深究作者的观点,无非是流行的P2P技术,如何带来安全隐患,如何给网络安全带来新的挑战等等。

可以肯定的是P2P技术存在大量的TCP/UDP连接,和大量打开的端口,与传统网络安全中尽量少开端口的得理念相背离。基于P2P的产品试图突破防火墙的控制,大量使用了80,443,22,21,25等知名端口,重这一点来说,一方宣布智能穿透防火墙,令一方宣布专业封杀P2P。大有水火不容的态势。

这一切在我看来,P2P产品没有一个标准的缘故,P2P行业的自律性也差了点,P2P服务目前所看到的利益来自于人气,这一点类是于传统网站,在P2P赢利模式不清楚的情况下,只有用户多了,才有赚钱的可能,使用P2P技术强制使用个人/单位/服务商的带宽,带宽就是钱啊,这么看来,P2P有点“抢钱”的意思,话在说回来,想要使用P2P提供的优质服务,总的付出点代价。这样,电信/网通就急了。那些被P2P占用带宽的公司也急了!

问题的解决,需要多方利益的统一,说起来容易做起来难。内容提供商,服务运营商,基础设施(主要是网络),用户以及这个链条中的其它环节,现在的利益还没有统一,彼此之间也不负责,大部分是服务运营商在唱独角戏。脚指头想都知道,有限的风投花完了该怎么办,出路在哪里?那个惨啊,可想而知。

P2P行业就这么完了吗?当初一点点小的技术创新,惹得互联网波涛汹涌的气势都哪里去了?泡沫过后,大家冷静下来的时候,也就是该有人站出来的时候了!会是谁呢,谁来整合这个残局?如果行业平静了,其实还是有很多人可以出来充当这一角色,他所要解决的问题就是拣贝壳,从贝壳中拿出那颗受过苦难的贝壳中孕育出的那颗珍珠。价值体现出来了,那些小贝壳们又出现了希望。传统媒体(主要是电视台),网络运营商(电信/网通),门户网站都可以充当这个拣贝壳的角色,充当P2P运营模式中的主流。非主流的情况应该是作为增值模块来提供某一服务,不过这个应该是个小头。

Google Video在目前我所在的区域拒绝提供服务,也无从评判现有P2P新媒体服务的好坏,当各种因素平衡的时候,P2P的春天会来临,仍在P2P行业里面奋斗的人们,请坚持你们的理想,保持这种创新的活力!

写道这里已经严重跑题,本来是想讨论P2P与网络安全的问题,却成了发表感慨。

不说了,兄弟我拭目以待!

星期一, 十月 16, 2006

Apache 2.0性能优化—MPM的选择与配置

转: http://docs.chinalinuxpub.com/read.php?wid=719

  Apache 2.0在性能上的改善最吸引人。在支持POSIX线程的Unix系统上,Apache可以通过不同的MPM运行在一种多进程与多线程相混合的模式下,增强部分配置的可扩充性能。相比于Apache 1.3,2.0版本做了大量的优化来提升处理能力和可伸缩性,并且大多数改进在默认状态下即可生效。但是在编译和运行时刻,2.0也有许多可以显著提高性能的选择。本文不想叙述那些以功能换取速度的指令,如HostnameLookups等,而只是说明在2.0中影响性能的最核心特性:MPM(Multi -Processing Modules,多道处理模块)的基本工作原理和配置指令。

  毫不夸张地说,MPM的引入是Apache 2.0最重要的变化。大家知道,Apache是基于模块化的设计,而Apache 2.0更扩展了模块化设计到Web服务器的最基本功能。服务器装载了一种多道处理模块,负责绑定本机网络端口、接受请求,并调度子进程来处理请求。扩展模块化设计有两个重要好处:

  ◆ Apache可以更简洁、有效地支持多种操作系统;

  ◆ 服务器可以按站点的特殊需要进行自定制。

  在用户级,MPM看起来和其它Apache模块非常类似。主要区别是在任意时刻只能有一种MPM被装载到服务器中。

  指定MPM的方法

  下面以Red Hat Linux 9为平台,说明在Apache 2.0中如何指定MPM (Apache采用2.0.45)。先解压缩源代码包httpd-2.0.45.tar.gz,生成httpd-2.0.45目录(Apache 1.3源代码包的命名规则是apache_1.3.NN.tar.gz,而2.0版则是httpd-2.0.NN.tar.gz,其中NN是次版本号)。

  进入httpd-2.0.45目录,运行以下代码:

 $ ./configure --help|grep mpm



  显示如下:

--with-mpm=MPM
Choose the process model for Apache to use.
MPM={beos|worker|prefork|mpmt_os2| perchild|leader|threadpool}



  上述操作用来选择要使用的进程模型,即哪种MPM模块。Beos、mpmt_os2分别是BeOS和OS/2上缺省的MPM, perchild主要设计目的是以不同的用户和组的身份来运行不同的子进程。这在运行多个需要CGI的虚拟主机时特别有用,会比1.3版中的SuExec 机制做得更好。leader和threadpool都是基于worker的变体,还处于实验性阶段,某些情况下并不会按照预期设想的那样工作,所以 Apache官方也并不推荐使用。因此,我们主要阐述prefork和worker这两种和性能关系最大的产品级MPM ( 有关其它的MPM详细说明,请参见Apache官方文档:http://httpd.apache.org/docs-2.0/mod/)

  prefork的工作原理及配置

  如果不用“--with-mpm”显式指定某种MPM,prefork就是Unix平台上缺省的MPM。它所采用的预派生子进程方式也是 Apache 1.3中采用的模式。prefork本身并没有使用到线程,2.0版使用它是为了与1.3版保持兼容性;另一方面,prefork用单独的子进程来处理不同的请求,进程之间是彼此独立的,这也使其成为最稳定的MPM之一。

  若使用prefork,在make编译和make install安装后,使用“httpd -l”来确定当前使用的MPM,应该会看到prefork.c(如果看到worker.c说明使用的是worker MPM,依此类推)。再查看缺省生成的httpd.conf配置文件,里面包含如下配置段:


StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0




  prefork的工作原理是,控制进程在最初建立“StartServers”个子进程后,为了满足MinSpareServers设置的需要创建一个进程,等待一秒钟,继续创建两个,再等待一秒钟,继续创建四个……如此按指数级增加创建的进程数,最多达到每秒32个,直到满足 MinSpareServers设置的值为止。这就是预派生(prefork)的由来。这种模式可以不必在请求到来时再产生新的进程,从而减小了系统开销以增加性能。

  MaxSpareServers设置了最大的空闲进程数,如果空闲进程数大于这个值,Apache会自动kill掉一些多余进程。这个值不要设得过大,但如果设的值比MinSpareServers小,Apache会自动把其调整为MinSpareServers+1。如果站点负载较大,可考虑同时加大MinSpareServers和MaxSpareServers。

  MaxRequestsPerChild设置的是每个子进程可处理的请求数。每个子进程在处理了“MaxRequestsPerChild” 个请求后将自动销毁。0意味着无限,即子进程永不销毁。虽然缺省设为0可以使每个子进程处理更多的请求,但如果设成非零值也有两点重要的好处:

  ◆ 可防止意外的内存泄漏;

  ◆ 在服务器负载下降的时侯会自动减少子进程数。

  因此,可根据服务器的负载来调整这个值。笔者认为10000左右比较合适。

  MaxClients是这些指令中最为重要的一个,设定的是Apache可以同时处理的请求,是对Apache性能影响最大的参数。其缺省值 150是远远不够的,如果请求总数已达到这个值(可通过ps -ef|grep http|wc -l来确认),那么后面的请求就要排队,直到某个已处理请求完毕。这就是系统资源还剩下很多而HTTP访问却很慢的主要原因。系统管理员可以根据硬件配置和负载情况来动态调整这个值。虽然理论上这个值越大,可以处理的请求就越多,但Apache默认的限制不能大于256。如果把这个值设为大于256,那么 Apache将无法起动。事实上,256对于负载稍重的站点也是不够的。在Apache 1.3中,这是个硬限制。如果要加大这个值,必须在“configure”前手工修改的源代码树下的src/include/httpd.h中查找 256,就会发现“#define HARD_SERVER_LIMIT 256”这行。把256改为要增大的值(如4000),然后重新编译Apache即可。在Apache 2.0中新加入了ServerLimit指令,使得无须重编译Apache就可以加大MaxClients。下面是笔者的prefork配置段:


StartServers 10
MinSpareServers 10
MaxSpareServers 15
ServerLimit 2000
MaxClients 1000
MaxRequestsPerChild 10000




  上述配置中,ServerLimit的最大值是20000,对于大多数站点已经足够。如果一定要再加大这个数值,对位于源代码树下server/mpm/prefork/prefork.c中以下两行做相应修改即可:

#define DEFAULT_SERVER_LIMIT 256
#define MAX_SERVER_LIMIT 20000



  worker的工作原理及配置

  相对于prefork,worker是2.0 版中全新的支持多线程和多进程混合模型的MPM。由于使用线程来处理,所以可以处理相对海量的请求,而系统资源的开销要小于基于进程的服务器。但是, worker也使用了多进程,每个进程又生成多个线程,以获得基于进程服务器的稳定性。这种MPM的工作方式将是Apache 2.0的发展趋势。

  在configure -with-mpm=worker后,进行make编译、make install安装。在缺省生成的httpd.conf中有以下配置段:


StartServers 2
MaxClients 150
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0




  worker的工作原理是,由主控制进程生成“StartServers”个子进程,每个子进程中包含固定的ThreadsPerChild 线程数,各个线程独立地处理请求。同样,为了不在请求到来时再生成线程,MinSpareThreads和MaxSpareThreads设置了最少和最多的空闲线程数;而MaxClients设置了所有子进程中的线程总数。如果现有子进程中的线程总数不能满足负载,控制进程将派生新的子进程。

  MinSpareThreads和MaxSpareThreads的最大缺省值分别是75和250。这两个参数对Apache的性能影响并不大,可以按照实际情况相应调节。

  ThreadsPerChild是worker MPM中与性能相关最密切的指令。ThreadsPerChild的最大缺省值是64,如果负载较大,64也是不够的。这时要显式使用 ThreadLimit指令,它的最大缺省值是20000。上述两个值位于源码树server/mpm/worker/worker.c中的以下两行:

#define DEFAULT_THREAD_LIMIT 64
#define MAX_THREAD_LIMIT 20000



  这两行对应着ThreadsPerChild和ThreadLimit的限制数。最好在configure之前就把64改成所希望的值。注意,不要把这两个值设得太高,超过系统的处理能力,从而因Apache不起动使系统很不稳定。

  Worker模式下所能同时处理的请求总数是由子进程总数乘以ThreadsPerChild值决定的,应该大于等于MaxClients。如果负载很大,现有的子进程数不能满足时,控制进程会派生新的子进程。默认最大的子进程总数是16,加大时也需要显式声明ServerLimit(最大值是20000)。这两个值位于源码树server/mpm/worker/worker.c中的以下两行:

#define DEFAULT_SERVER_LIMIT 16
#define MAX_SERVER_LIMIT 20000



  需要注意的是,如果显式声明了ServerLimit,那么它乘以ThreadsPerChild的值必须大于等于MaxClients,而且MaxClients必须是ThreadsPerChild的整数倍,否则Apache将会自动调节到一个相应值(可能是个非期望值)。下面是笔者的 worker配置段:


StartServers 3
MaxClients 2000
ServerLimit 25
MinSpareThreads 50
MaxSpareThreads 200
ThreadLimit 200
ThreadsPerChild 100
MaxRequestsPerChild 0




  通过上面的叙述,可以了解到Apache 2.0中prefork和worker这两个重要MPM的工作原理,并可根据实际情况来配置Apache相关的核心参数,以获得最大的性能和稳定性。



其它更详细的写以到http://httpd.apache.org/docs-2.0/

C10K 10K并发连接问题!

此文是 http://www.kegel.com/c10k.html 的翻译与理解。

已经是WEB服务器解决10K并发连接的时候了,你不这么认为吗?毕竟,WEB是现拥有最大的地盘。

现在的的 计算机配置更高了。$1,200美元左右可以买到1G处理器的机器,装有2G的内存与千兆网卡。我们看看,如果有20K客户端,那么每个客户端分得50KHz CPU,100K字节内存,50Kbits/sec的带宽。(这种情况下每个客户端花费6美分。那些每客户端100美元的服务器许可证费用看上去贵了点)。所以,硬件不再是瓶颈。(老外就是细致,这都算的这么 清楚!)

在1999年,最忙的一台FTP站点 cdrom.com, 实际上已经通过G比特以太网并发处理了10K的连接。到了2001年,相同的速度的服务已经在几个ISP中提供,它们希望能够为规模不断扩大的商业用户提供服务。

瘦客户端模型的计算形式又回来了,这时网络上工作的服务器,服务者数以千计的客户。

文中讨论了一些如何配置操作系统以及如何写代码支持以千计的客户端的问题,讨论集中在类Unix操作系统,但是Windows也有所涉及!

星期日, 十月 15, 2006

WebOS, YouOS

WebOS, YouOS, 互联网大了,什么鸟都有?

 Posted by Picasa

星期六, 十月 14, 2006

感受 血色浪漫 的青春!

虽然自己文笔低略,但看完《血色浪漫》仍有写些感受的冲动。如其名,70年代成长起来的男人那些没有因为打架而出过彩的。记得一次我老婆给我洗头,看到我头上有块地方没有头发,问我原因,我回答说小时候打架被扔来的石头打了洞。老婆嘲笑我,说:你那么笨,打架肯定打打不过别人。是啊,那个年代,乖乖好学生也会打架到头破血流。

说实话,我是70年代出生,80年代长大的,70年代的记忆基本没有,但记得家里面有一幅1980年的墙画,大概是那个沂蒙山的红军芭蕾舞剧,不过好多人家里贴了将军、伟人的墙画。不过血色浪漫里的好多细节应该都是那个时代真实的反映。至少父母老是提起当年下乡的知青。

感受最深的要数里面信天游的歌曲,也只有在那片土地的人才能够引起深层次的共鸣。山西人走西口,闯关东的故事已经成为了历史,成为提到晋商必须提到的事。而与这些故事相关的信天游的传唱,却是见证在那块土地成长的烙印。高中毕业的时候,大家开毕业会,我们的生物学老师 [注: 一个年轻漂亮毕业于山西大学的女孩,也是我们本地人] 唱了一首走西口~~~哥哥你走西口,小妹妹我实在难留~~~,唱得自然是好,心里那个感动!想想,这种通过读书考学走出那块土地的我,与当年走西口谋生的先辈们也没什么区别。我可是那位老师得意的学生,虽然只带了我一年的生物课,可老师对我那个映像绝对深刻。大概是她在教第5章的时候,我基本上已经看完了整本书,在她下来巡视的时候,我被发现看别的课程,生气地问我自然选择与遗传变异的关系(达尔文物种起源,高中生物学第7章的内容吧),我回答的那个顺利啊!由于我这个人看得东西虽多,做的练习题却比别人少的多,结果在10年前的高考成绩却不理想,勉强混入北京。

知青下乡肯定有很多东西是他们在城里所接触不到的。童年的有些事情记忆深刻,有的甜蜜,有的苦楚。城里的孩子可能听都没有听过,一并说出来,我这叫插播 绿色童年。我们不讨论钟跃民看到那个小孩带来的烤田鼠时候的感觉,童年时候确实是烤过知了和麻雀以及玉米土豆地瓜之流,这段以后再说吧。记得小时候家里的大人们到田里干活,就把小孩放到一边自己玩。(托尔所,幼儿园那可不是我小时候就知道的概念) 我和一个小朋友在天边玩,玩到了灌木丛里,碰到了里面的马蜂窝,我们常用捅了马蜂窝来形容惹了多大的麻烦,你们想不到的那个情况,俩个娃哭的那个惨。我那慈爱的爷爷一把火扫了那个马蜂窝,算是为我报仇。还有一次,在田边的路上被自行车撞个正着,两眼之间,一个血洞,待到医院缝了四针,包扎好后绝对就是戏剧里面的小丑角色。我十岁左右的时候,和我一起被蜂蜇的那个朋友因为得了鼠疫(就是那种可怕的传染病)而死去,此后,也不可能在有关于这位小伙伴其它的回忆。(小时候打架/打仗的事情有空再说)。

关于布票/粮票的记忆:布票的记忆来自于母亲的口述,粮票的记忆是初中时候的。爷爷奶奶和我们住在一条胡同相隔的两个院子里,母亲让我拿五尺布票给奶奶,就在两个院子的路上我跑去玩了后来把那个布票弄丢了,回来我和母亲撒谎,说给了奶奶。结果呢自然是被打一顿。初中的时候,由于全国粮票的即将作废,所以有一段时间经常去粮店买馒头、花卷。

2006-10-8到2006-10-13访问情况 Posted by Picasa

星期二, 十月 10, 2006

mod_lua性能提升,是mod_python速度115%

在上次的性能对比之后,忍不住对mod_lua的程序作了改进,下面是最新的测试结果

mod_python
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)


Server Software: Apache/2.0.59
Server Hostname: localhost
Server Port: 80

Document Path: /tmp/hello.py
Document Length: 12 bytes

Concurrency Level: 1
Time taken for tests: 1.78125 seconds
Complete requests: 500
Failed requests: 0
Write errors: 0
Total transferred: 102500 bytes
HTML transferred: 6000 bytes
Requests per second: 463.77 [#/sec] (mean)
Time per request: 2.156 [ms] (mean)
Time per request: 2.156 [ms] (mean, across all concurrent requests)
Transfer rate: 92.75 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 2.3 0 15
Processing: 0 1 4.7 0 15
Waiting: 0 1 4.5 0 15
Total: 0 1 5.1 0 15

Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 15
95% 15
98% 15
99% 15
100% 15 (longest request)
mod_lua
This is ApacheBench, Version 2.0.41-dev <$Revision: 1.121.2.12 $> apache-2.0
Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright (c) 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)


Server Software: Apache/2.0.59
Server Hostname: localhost
Server Port: 80

Document Path: /tmp/hello.lua
Document Length: 12 bytes

Concurrency Level: 1
Time taken for tests: 0.937500 seconds
Complete requests: 500
Failed requests: 0
Write errors: 0
Total transferred: 102000 bytes
HTML transferred: 6000 bytes
Requests per second: 533.33 [#/sec] (mean)
Time per request: 1.875 [ms] (mean)
Time per request: 1.875 [ms] (mean, across all concurrent requests)
Transfer rate: 105.60 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 2.2 0 15
Processing: 0 1 4.2 0 15
Waiting: 0 1 4.1 0 15
Total: 0 1 4.7 0 15

Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 15
95% 15
98% 15
99% 15
100% 15 (longest request)
这次修改可谓全胜
速度比如下每秒处理的请求 463.77/533.33(mod_python/mod_lua)
每个请求处理花费的时间微妙: 2.156/1.875(mod_python/mod_lua)

性能提升的原因说明:

1 mod_lua从串行处理方式改成并行处理方式
2 lua虚拟机采用了空闲池处理

星期六, 十月 07, 2006

OpenSSL 公开加密算法中如何添加新算法

赵治国 zhaozg(at)gmail(dot)com http://zhaozg.googlepages.com/

2002-07-06: Word文档写完

2005-04-07: 发布于个人Blogger







OpenSSL 中公开密钥算法中,我仅仅讨论RSA算法模块!




与RSA相关的数据结构包括。

typedef struct rsa_st RSA;


typedef struct rsa_meth_st

{

const char *name;

int (*rsa_pub_enc)(int flen,unsigned char *from,unsigned char *to,

RSA *rsa,int padding);

int (*rsa_pub_dec)(int flen,unsigned char *from,unsigned char *to,

RSA *rsa,int padding);

int (*rsa_priv_enc)(int flen,unsigned char *from,unsigned char *to,

RSA *rsa,int padding);

int (*rsa_priv_dec)(int flen,unsigned char *from,unsigned char *to,

RSA *rsa,int padding);

int (*rsa_mod_exp)(BIGNUM *r0,BIGNUM *I,RSA *rsa); /* Can be null */

int (*bn_mod_exp)(BIGNUM *r, BIGNUM *a, const BIGNUM *p,

const BIGNUM *m, BN_CTX *ctx,

BN_MONT_CTX *m_ctx); /* Can be null */

int (*init)(RSA *rsa); /* called at new */

int (*finish)(RSA *rsa); /* called at free */

int flags; /* RSA_METHOD_FLAG_* things */

char *app_data; /* may be needed! */

/* New sign and verify functions: some libraries don't allow arbitrary data

* to be signed/verified: this allows them to be used. Note: for this to work

* the RSA_public_decrypt() and RSA_private_encrypt() should *NOT* be used

* RSA_sign(), RSA_verify() should be used instead. Note: for backwards

* compatibility this functionality is only enabled if the RSA_FLAG_SIGN_VER

* option is set in 'flags'.

*/

int (*rsa_sign)(int type, unsigned char *m, unsigned int m_len,

unsigned char *sigret, unsigned int *siglen, RSA *rsa);

int (*rsa_verify)(int dtype, unsigned char *m, unsigned int m_len,

unsigned char *sigbuf, unsigned int siglen, RSA *rsa);


} RSA_METHOD;


/* 对RSA算法的添加一般式提供可供OPENSSL模块使用的加速卡和加密卡接口,将操作接口采用上面的结构进行封装,并不是上面所有的接口都需要实现,但四个加解密函数以及初始化和结束函数应该是必须提供的*/


struct rsa_st

{

/* The first parameter is used to pickup errors where

* this is passed instead of aEVP_PKEY, it is set to 0 */

int pad;

int version;

#if 0

RSA_METHOD *meth;

#else

struct engine_st *engine;

#endif

BIGNUM *n; /* n=pq */

BIGNUM *e; /* e和n,为公开密钥,e与(p-1)(q-1)互为素数 */

BIGNUM *d; /* 私人密钥,ed=1mod(p-1)(q-1) */

BIGNUM *p; /* 大素数 p ,p q 互为素数,且长度一样 */

BIGNUM *q; /* 大素数 q */

BIGNUM *dmp1;

BIGNUM *dmq1;

BIGNUM *iqmp;

/* be careful using this if the RSA structure is shared */

CRYPTO_EX_DATA ex_data;

int references;

int flags;


/* Used to cache montgomery values */

BN_MONT_CTX *_method_mod_n;

BN_MONT_CTX *_method_mod_p;

BN_MONT_CTX *_method_mod_q;


/* all BIGNUM values are actually in the following data, if it is not

* NULL */

char *bignum_data;

BN_BLINDING *blinding;

};

/* 加密过程 c=me mod n 解密过程 m=cd mod n */

/* 根据加解密算法的需要,以及使用的是公钥还是私钥,上面的结构中的字段需要有选择地实现,在任何时候,对n和e,以及flags的初始化都是不可少的*/

/* 如果是为硬件密码引擎提供接口,可以参考crypt/engine/hw_*.c的各个模块*/


/* 关于大数的描述将在另外的文档中说明 */
/* Type needs to be a bit field

* Sub-type needs to be for variations on the method, as in, can it do

* arbitrary encryption.... */

typedef struct evp_pkey_st

{

int type;

int save_type;

int references;

union {

char *ptr;

#ifndef NO_RSA

struct rsa_st *rsa; /* RSA */

#endif

#ifndef NO_DSA

struct dsa_st *dsa; /* DSA */

#endif

#ifndef NO_DH

struct dh_st *dh; /* DH */

#endif

} pkey;

int save_parameters;

STACK_OF(X509_ATTRIBUTE) *attributes; /* [ 0 ] */

} EVP_PKEY;


#define EVP_PKEY_MO_SIGN 0x0001

#define EVP_PKEY_MO_VERIFY 0x0002

#define EVP_PKEY_MO_ENCRYPT 0x0004

#define EVP_PKEY_MO_DECRYPT 0x0008


用于封装公开密钥算法中,各种公开密钥和私人密钥的结构。


OpenSSL 中硬件引擎所要完成的工作本质上就是实现RSA_METHOD结构中的函数指针。


/* 软件算法RSA算法如何生成公私密钥对 */

RSA *RSA_generate_key(int bits, unsigned long e_value,

void (*callback)(int,int,void *), void *cb_arg)

{

/*

输入

bits, 密钥长度

e_value, e值,也就是公开密钥

callback 回调函数

cb_args 回调函数的参数

*/

RSA *rsa=NULL;

BIGNUM *r0=NULL,*r1=NULL,*r2=NULL,*r3=NULL,*tmp;

int bitsp,bitsq,ok= -1,n=0,i;

BN_CTX *ctx=NULL,*ctx2=NULL;


ctx=BN_CTX_new();

if (ctx == NULL) goto err;

ctx2=BN_CTX_new();

if (ctx2 == NULL) goto err;

BN_CTX_start(ctx);

r0 = BN_CTX_get(ctx);

r1 = BN_CTX_get(ctx);

r2 = BN_CTX_get(ctx);

r3 = BN_CTX_get(ctx);

if (r3 == NULL) goto err;


/* 大质数p,q的位数 bitsp,bitsq

如果RSA采用了bits位加密强度

则:大质数p的长度为,bitsp = (bits+1)/2

大质数q的长度为,bitsq =bits-bitsp,

*/

bitsp=(bits+1)/2;

bitsq=bits-bitsp;

rsa=RSA_new();

if (rsa == NULL) goto err;


/* set e */

rsa->e=BN_new();

if (rsa->e == NULL) goto err;


#if 1

/* The problem is when building with 8, 16, or 32 BN_ULONG,

* unsigned long can be larger */

for (i=0; i
{

if (e_value & (1UL<
BN_set_bit(rsa->e,i);

}

#else

if (!BN_set_word(rsa->e,e_value)) goto err;

#endif


/* generate p and q */

for (;;)

{

rsa->p=BN_generate_prime(NULL,bitsp,0,NULL,NULL,callback,cb_arg);

if (rsa->p == NULL) goto err;

if (!BN_sub(r2,rsa->p,BN_value_one())) goto err;

if (!BN_gcd(r1,r2,rsa->e,ctx)) goto err;

if (BN_is_one(r1)) break;

if (callback != NULL) callback(2,n++,cb_arg);

BN_free(rsa->p);

}

if (callback != NULL) callback(3,0,cb_arg);

for (;;)

{

rsa->q=BN_generate_prime(NULL,bitsq,0,NULL,NULL,callback,cb_arg);

if (rsa->q == NULL) goto err;

if (!BN_sub(r2,rsa->q,BN_value_one())) goto err;

if (!BN_gcd(r1,r2,rsa->e,ctx)) goto err;

if (BN_is_one(r1) && (BN_cmp(rsa->p,rsa->q) != 0))

break;

if (callback != NULL) callback(2,n++,cb_arg);

BN_free(rsa->q);

}

if (callback != NULL) callback(3,1,cb_arg);

if (BN_cmp(rsa->p,rsa->q) <>
{

tmp=rsa->p;

rsa->p=rsa->q;

rsa->q=tmp;

}


/* calculate n */

rsa->n=BN_new();

if (rsa->n == NULL) goto err;

if (!BN_mul(rsa->n,rsa->p,rsa->q,ctx)) goto err;


/* calculate d */

if (!BN_sub(r1,rsa->p,BN_value_one())) goto err; /* p-1 */

if (!BN_sub(r2,rsa->q,BN_value_one())) goto err; /* q-1 */

if (!BN_mul(r0,r1,r2,ctx)) goto err; /* (p-1)(q-1) */


/* should not be needed, since gcd(p-1,e) == 1 and gcd(q-1,e) == 1 */

/* for (;;)

{

if (!BN_gcd(r3,r0,rsa->e,ctx)) goto err;

if (BN_is_one(r3)) break;


if (1)

{

if (!BN_add_word(rsa->e,2L)) goto err;

continue;

}

RSAerr(RSA_F_RSA_GENERATE_KEY,RSA_R_BAD_E_VALUE);

goto err;

}

*/

/* 生成d,私人密钥 */

rsa->d=BN_mod_inverse(NULL,rsa->e,r0,ctx2); /* d */

if (rsa->d == NULL) goto err;


/* 生成中间计算结果 */

/* calculate d mod (p-1) */

rsa->dmp1=BN_new();

if (rsa->dmp1 == NULL) goto err;

if (!BN_mod(rsa->dmp1,rsa->d,r1,ctx)) goto err;


/* calculate d mod (q-1) */

rsa->dmq1=BN_new();

if (rsa->dmq1 == NULL) goto err;

if (!BN_mod(rsa->dmq1,rsa->d,r2,ctx)) goto err;


/* calculate inverse of q mod p */

rsa->iqmp=BN_mod_inverse(NULL,rsa->q,rsa->p,ctx2);

if (rsa->iqmp == NULL) goto err;


ok=1;

err:

if (ok == -1)

{

RSAerr(RSA_F_RSA_GENERATE_KEY,ERR_LIB_BN);

ok=0;

}

BN_CTX_end(ctx);

BN_CTX_free(ctx);

BN_CTX_free(ctx2);


if (!ok)

{

if (rsa != NULL) RSA_free(rsa);

return(NULL);

}

else

return(rsa);

}



/* This is a structure for storing implementations of various crypto

* algorithms and functions. */

typedef struct engine_st

{

const char *id;

const char *name;

RSA_METHOD *rsa_meth;

DSA_METHOD *dsa_meth;

DH_METHOD *dh_meth;

RAND_METHOD *rand_meth;

BN_MOD_EXP bn_mod_exp;

BN_MOD_EXP_CRT bn_mod_exp_crt;

int (*init)(void);

int (*finish)(void);

int (*ctrl)(int cmd, long i, void *p, void (*f)());

EVP_PKEY *(*load_privkey)(const char *key_id, const char *passphrase);

EVP_PKEY *(*load_pubkey)(const char *key_id, const char *passphrase);

int flags;

/* reference count on the structure itself */

int struct_ref;

/* reference count on usability of the engine type. NB: This

* controls the loading and initialisation of any functionlity

* required by this engine, whereas the previous count is

* simply to cope with (de)allocation of this structure. Hence,

* running_ref <= struct_ref at all times. */

int funct_ref;

/* Used to maintain the linked-list of engines. */

struct engine_st *prev;

struct engine_st *next;

} ENGINE;


/* 描述硬件引擎的数据结构,需要实现的字段包括,id,name,init,finish,如果引擎提供rsa算法,则必须实现rsa_method */


/* 提供返回ENGINE的接口,并在crypt/engine/engine_int.h申明。*/

ENGINE *ENGINE_HardCard();

如果想把HardCard加到OPENSSL的缺省引擎列表中,需要在在crypto/engine/engine_list.c 的engine_internal_check()中添加实现代码。

static int engine_internal_check(void)

{

if(engine_list_flag)

return 1;

/* This is our first time up, we need to populate the list

* with our statically compiled-in engines. */

if(!engine_list_add(ENGINE_openssl()))

return 0;

#ifndef NO_HW

#ifndef NO_HW_CSWIFT

if(!engine_list_add(ENGINE_cswift()))

return 0;

#endif /* !NO_HW_CSWIFT */

#ifndef NO_HW_NCIPHER

if(!engine_list_add(ENGINE_ncipher()))

return 0;

#endif /* !NO_HW_NCIPHER */

#ifndef NO_HW_ATALLA

if(!engine_list_add(ENGINE_atalla()))

return 0;

#endif /* !NO_HW_ATALLA */

#endif /* !NO_HW */

engine_list_flag = 1;

return 1;

}

否则,你在使用引擎之前必须手动完成engine_list_add这个过程。



在init函数中实现引擎的初始化,在ENGINE_set_default(ENGINE *e,long flag)还函数中调用,finish释放Engine所占用的资源。


如果要采用引擎实现的算法,可以通过下面的程序来实现。

ENGINE *e=NULL;

if((e = ENGINE_by_id("pkicard")) == NULL)

{

printf("invalid engine pkicard");

}

printf("engine pkicard set\n");

if(!ENGINE_set_default(e, ENGINE_METHOD_RSA))

{/* 引擎提供的加密算法被制定为缺省的RSA算法 */

printf("PKICARD RSA\n");

goto end;

}


在实现公钥和私钥加解密的时候会涉及到padding的问题,padding的处理方式必须与标准RSA算法相同。


RSA_pub_dec()采用RSA_padding_check_PKCS1_type_1去调明文数据的padding.

RSA_pub_enc()采用RSA_padding_add_PKCS1_type_2去调明文数据的padding.

RSA_priv_dec()采用RSA_padding_check_PKCS1_type_2去调明文数据的padding.

RSA_priv_enc()采用RSA_padding_add_PKCS1_type_1去调明文数据的padding.

以上是PKCS1的padding处理方式,更为详细的资料请来查看源代码

crypt/rsa/rsa_pk1.c


在crypt/rsa/rsa.h还能找到其他的padding方式。具体的应用请查看相关代码。


在接口实现的过程中,各种加密卡的字节序和位序可能各不相同,需要作提要的调整以满足OpenSSL的接口。


另外:engine中提供了不少加密卡的引擎代码,可以参考实现之。



































 

OpenSSL 对称加密算法中如何添加新算法

赵治国 zhaozg(at)gmail(dot)com http://zhaozg.googlepages.com/

2002-06-18: Word文档写完

2005-04-07: 发布于个人Blogger



1.关于加密算法的加载



在调用加密算法之前,通过调用OpenSSL_add_all_algorithms来加载加密算法函数和单向散列算法函数




void OpenSSL_add_all_algorithms(void)

{

OpenSSL_add_all_ciphers(); /* 加载加密算法 */

OpenSSL_add_all_digests(); /* 加载单向散列函数 */

}


void OpenSSL_add_all_ciphers(void)函数实现如下:


void OpenSSL_add_all_ciphers(void)

{

EVP_add_cipher(EVP_rc2_cfb());

。。。。。。

PKCS12_PBE_add();

PKCS5_PBE_add();

}

/* 这个过程的主要任务是向全局变量,static LHASH *names_lh,注册加密算法,如果添加了新的加密算法,必需向names_lh注册。 */


由于DES算法接口内容较多,所以我们从IDEA算法的接口开始研究!

#ifndef NO_IDEA

EVP_add_cipher(EVP_idea_ecb()); /*添加EBC加密模式 */

EVP_add_cipher(EVP_idea_cfb()); /*添加CFB加密模式 */

EVP_add_cipher(EVP_idea_ofb()); /*添加OCF加密模式 */

EVP_add_cipher(EVP_idea_cbc()); /*添加CBC加密模式 */

EVP_add_cipher_alias(SN_idea_cbc,"IDEA"); /*添加cbc加密算法的别名 “IDEA” */

EVP_add_cipher_alias(SN_idea_cbc,"idea"); /*添加cbc加密算法的别名 “idea” */

#endif


在包括IDEA加密算法的情况下,OpenSSL将会选择IDAE加密算法模块!


下面来看看EVP_add_cipher函数是怎么实现的,

int EVP_add_cipher(EVP_CIPHER *c)

{

int r;


r=OBJ_NAME_add(OBJ_nid2sn(c->nid),OBJ_NAME_TYPE_CIPHER_METH,(char *)c);

if (r == 0) return(0);

r=OBJ_NAME_add(OBJ_nid2ln(c->nid),OBJ_NAME_TYPE_CIPHER_METH,(char *)c);

return(r);

}


/* 向全决变量names_lh 注册 obj_name_types 变量的过程 */

int OBJ_NAME_add(const char *name, int type, const char *data)

{

OBJ_NAME *onp,*ret;

int alias;


if ((names_lh == NULL) && !OBJ_NAME_init()) return(0);


alias=type&OBJ_NAME_ALIAS;

type&= ~OBJ_NAME_ALIAS;


onp=(OBJ_NAME *)OPENSSL_malloc(sizeof(OBJ_NAME));

if (onp == NULL)

{

/* ERROR */

return(0);

}


onp->name=name;

onp->alias=alias;

onp->type=type;

onp->data=data;


ret=(OBJ_NAME *)lh_insert(names_lh,onp);

if (ret != NULL)

{

/* free things */

if ((name_funcs_stack != NULL) && (sk_NAME_FUNCS_num(name_funcs_stack) > ret->type))

{

/* XXX: I'm not sure I understand why the free

* function should get three arguments...

* -- Richard Levitte

*/

sk_NAME_FUNCS_value(name_funcs_stack,ret->type)

->free_func(ret->name,ret->type,ret->data);

}

OPENSSL_free(ret);

}

else

{

if (lh_error(names_lh))

{

/* ERROR */

return(0);

}

}

return(1);

}


names_lh 是 LHASH的全局变量,用于维护obj_name_types的类型的变量。(在crypt/objects/o_names.c中定义)


(crypt/objects/obj_dat.h)相关的全局变量

static unsigned char lvalues[2896] 全局变量,已经初始化,存放了OpenSSL所有Object的相关信息。

nid_objs 是ASN1_OBJECT结构的数组全局变量,已经初始化,记录了所有OpenSSL用到的类型的名字

static ASN1_OBJECT *sn_objs[NUM_SN] 全局变量,已经初始化。

static ASN1_OBJECT *ln_objs[NUM_LN] 全局变量,已经初始化。


crypt/object/objects.h 中定义的结构

typedef struct obj_name_st

{

int type;

int alias;

const char *name;

const char *data;

} OBJ_NAME;


注意:crypto/objects 目录下面维护整个OpenSSL模块化的重要的程序,下面逐个做出介绍。

objects.txt 按照一定的语法结构,定义了SN_base, LN_base, NID_base,OBJ_base。经过perl程序objects.pl通过命令perl objects.pl objects.txt obj_mac.num obj_mac.h 处理后,生成了obj_mac.num 和obj_mac.h两个文件。

obj_mac.num 用来查阅 OBJ_base与NID_base之间的对应关系。

obj_mac.h 用来提供c语言类型SN_base, LN_base, NID_base,OBJ_base定义。

objects.h 同样提供了c语言类型SN_base, LN_base, NID_base,OBJ_base定义,在obj_mac.h 更新之后,必须对对应的objects.h 中的内容作出同步,及保持与obj_mac.h的定义一至,同时objects.h中也声明了一些对OBJ_name的操作函数。

objects.h 经过perl程序perl obj_dat.pl objects.h obj_dat.h处理之后,生成obj_dat.h头文件。

obj_dat.h 中定义了如下的全局变量。


#define NUM_NID 393

#define NUM_SN 392

#define NUM_LN 392

#define NUM_OBJ 366

static unsigned char lvalues[2896],

static ASN1_OBJECT nid_objs[NUM_NID],

static ASN1_OBJECT *sn_objs[NUM_SN],

static ASN1_OBJECT *ln_objs[NUM_LN],

static ASN1_OBJECT *obj_objs[NUM_OBJ],

这些变量多有ASN1_OBJECT有关,在objects.txt中定义的所有的对象都会体现出来,具体的作用还不太清楚,需要继续研究!


对OpenSSL中宏的研究

密码算法接口的定义

typedef struct evp_cipher_st EVP_CIPHER;

/* 加密算法后被names_lh来管理,可以通算法的名称或别名来检索 */

struct evp_cipher_st

{

int nid; /*加密算法的nid*/

int block_size; /*数据块的大小 */

int key_len; /* Default value for variable length ciphers */

int iv_len; /* 对于CBC,CFB,OFB的加密算法初始化矢量*/

unsigned long flags; /* Various flags */

int (*init)(EVP_CIPHER_CTX *ctx, const unsigned char *key,

const unsigned char *iv, int enc); /* init key */

int (*do_cipher)(EVP_CIPHER_CTX *ctx, unsigned char *out,

const unsigned char *in, unsigned int inl);/* encrypt/decrypt data */

int (*cleanup)(EVP_CIPHER_CTX *); /* cleanup ctx */

int ctx_size; /* how big the ctx needs to be */

int (*set_asn1_parameters)(EVP_CIPHER_CTX *, ASN1_TYPE *); /* Populate a ASN1_TYPE with parameters */

int (*get_asn1_parameters)(EVP_CIPHER_CTX *, ASN1_TYPE *); /* Get parameters from a ASN1_TYPE */

int (*ctrl)(EVP_CIPHER_CTX *, int type, int arg, void *ptr); /* Miscellaneous operations */

void *app_data; /* Application data */

};

如果正确定义了EVP_CIPHER变量,这个算法就可以被OpenSSL所接受了。


下面的宏将定义ECB,CBC,CFB,OFB算法EVP_CIPHER定义。

#define BLOCK_CIPHER_defs(cname, kstruct, \

nid, block_size, key_len, iv_len, flags,\

init_key, cleanup, set_asn1, get_asn1, ctrl)\

static EVP_CIPHER cname##_cbc = {\

nid##_cbc, block_size, key_len, iv_len, \

flags | EVP_CIPH_CBC_MODE,\

init_key,\

cname##_cbc_cipher,\

cleanup,\

sizeof(EVP_CIPHER_CTX)-sizeof((((EVP_CIPHER_CTX *)NULL)->c))+\

sizeof((((EVP_CIPHER_CTX *)NULL)->c.kstruct)),\

set_asn1, get_asn1,\

ctrl, \

NULL \

};\

EVP_CIPHER *EVP_##cname##_cbc(void) { return &cname##_cbc; }\

static EVP_CIPHER cname##_cfb = {\

nid##_cfb64, 1, key_len, iv_len, \

flags | EVP_CIPH_CFB_MODE,\

init_key,\

cname##_cfb_cipher,\

cleanup,\

sizeof(EVP_CIPHER_CTX)-sizeof((((EVP_CIPHER_CTX *)NULL)->c))+\

sizeof((((EVP_CIPHER_CTX *)NULL)->c.kstruct)),\

set_asn1, get_asn1,\

ctrl,\

NULL \

};\

EVP_CIPHER *EVP_##cname##_cfb(void) { return &cname##_cfb; }\

static EVP_CIPHER cname##_ofb = {\

nid##_ofb64, 1, key_len, iv_len, \

flags | EVP_CIPH_OFB_MODE,\

init_key,\

cname##_ofb_cipher,\

cleanup,\

sizeof(EVP_CIPHER_CTX)-sizeof((((EVP_CIPHER_CTX *)NULL)->c))+\

sizeof((((EVP_CIPHER_CTX *)NULL)->c.kstruct)),\

set_asn1, get_asn1,\

ctrl,\

NULL \

};\

EVP_CIPHER *EVP_##cname##_ofb(void) { return &cname##_ofb; }\

static EVP_CIPHER cname##_ecb = {\

nid##_ecb, block_size, key_len, iv_len, \

flags | EVP_CIPH_ECB_MODE,\

init_key,\

cname##_ecb_cipher,\

cleanup,\

sizeof(EVP_CIPHER_CTX)-sizeof((((EVP_CIPHER_CTX *)NULL)->c))+\

sizeof((((EVP_CIPHER_CTX *)NULL)->c.kstruct)),\

(ctx_size 其中有联合的结构,如何获取EVP_CIPHER_CTX数据长度)

set_asn1, get_asn1,\

ctrl,\

NULL \

};\

EVP_CIPHER *EVP_##cname##_ecb(void) { return &cname##_ecb; }

上面的宏在经过处理之后,变成了四中加密模式的EVP_CIPHER定义,这个结构中封装了加密操作汉书,密钥初始化函数,以及密钥的清理函数。除了实现加密算法之外,还比需实现对应的密钥结构!


EVP_CIPHER_CTX就是密钥结构,完成对加密算法密钥的管理。

typedef struct evp_cipher_ctx_st EVP_CIPHER_CTX;

struct evp_cipher_ctx_st

{

const EVP_CIPHER *cipher;

int encrypt; /* encrypt or decrypt */

int buf_len; /* number we have left */


unsigned char oiv[EVP_MAX_IV_LENGTH]; /* original iv */

unsigned char iv[EVP_MAX_IV_LENGTH]; /* working iv */

unsigned char buf[EVP_MAX_IV_LENGTH]; /* saved partial block */

int num; /* used by cfb/ofb mode */


void *app_data; /* application stuff */

int key_len; /* May change for variable length cipher */

/* 通过联合的方式管理密钥,对各种密钥实现灵活的管理 */

union {

#ifndef NO_RC4

struct

{

unsigned char key[EVP_RC4_KEY_SIZE];

RC4_KEY ks; /* working key */

} rc4;

#endif

#ifndef NO_DES

des_key_schedule des_ks;/* key schedule */

struct

{

des_key_schedule ks;/* key schedule */

des_cblock inw;

des_cblock outw;

} desx_cbc;

struct

{

des_key_schedule ks1;/* key schedule */

des_key_schedule ks2;/* key schedule (for ede) */

des_key_schedule ks3;/* key schedule (for ede3) */

} des_ede;

#endif

#ifndef NO_IDEA

IDEA_KEY_SCHEDULE idea_ks;/* key schedule */

#endif

#ifndef NO_RC2

struct {

int key_bits; /* effective key bits */

RC2_KEY ks;/* key schedule */

} rc2;

#endif

#ifndef NO_RC5

struct {

int rounds; /* number of rounds */

RC5_32_KEY ks;/* key schedule */

} rc5;

#endif

#ifndef NO_BF

BF_KEY bf_ks;/* key schedule */

#endif

#ifndef NO_CAST

CAST_KEY cast_ks;/* key schedule */

#endif

} c;

};


下面的函数用来实现设定加密密钥和解密密钥。

void set_encrypt_key(const unsigned char *key, KEY_SCHEDULE *ks)

void set_decrypt_key(const unsigned char *key, KEY_SCHEDULE *ks)

在这两个函数的基础上实现EVP_CIPHER中密钥初始化函数。

static int init_key(EVP_CIPHER_CTX *ctx, const unsigned char *key,

const unsigned char *iv, int enc)

{

if(!enc)

{

if (EVP_CIPHER_CTX_mode(ctx) == EVP_CIPH_OFB_MODE) enc = 1;

else if (EVP_CIPHER_CTX_mode(ctx) == EVP_CIPH_CFB_MODE) enc = 1;

}

if (enc)

set_encrypt_key(key,&(ctx->c. ks));

else

{

set_decrypt_key(key,&(ctx->c. ks));

}

return 1;

}


/* 清除保留在内存中的密码 */

static int clean_key(EVP_CIPHER_CTX *ctx)

{

if(ctx)

memset(&(ctx-c.ks),0,sizeof(ctx->c.ks));

return 1;

}


如果加密算法结构EVP_CIPHER是通过BLOCK_CIPHER_defs宏定义的,则四种模式的算法接口必须何处理宏之后的接口一样:

int cname_ecb_cipher(EVP_CIPHER_CTX *ctx, unsigned char *out, const unsigned char *in, unsigned int inl);

int cname_cbc_cipher(EVP_CIPHER_CTX *ctx, unsigned char *out, const unsigned char *in, unsigned int inl);

int cname_cfb_cipher(EVP_CIPHER_CTX *ctx, unsigned char *out, const unsigned char *in, unsigned int inl);

int cname_ofb_cipher(EVP_CIPHER_CTX *ctx, unsigned char *out, const unsigned char *in, unsigned int inl);
二:四种加密模式的实现

四种加密模式与基本加解密算法之间的关系!

假设加密算法和解密算法的实现函数接口如下

void encrypt(const unsigned char *in, unsigned char *out, const KEY *key, int *length)

void decrypt(const unsigned char *in, unsigned char *out, const KEY *key, int *length)


void ecb_ encrypt(const unsigned char *in, unsigned char *out,

long length, const KEY *key, unsigned char *iv, int enc)

{/* 电子密码本: Electronic Code Book */

register int i;

int len = 8;

register long l = length;

unsigned char buf[8];


if(enc)/*encryption*/

{

for(i=0;i<=l;i+=8)

{

encrypt(&in[i], &out[i], key, &len);/*len == 8 will always be true here*/

}

else

{

for(i=0;i<=l;i+=8)

{

decrypt(&in[i], &out[i], key, &len);/*len == 8 will always be true here*/

}

}

}


void cbc_encrypt(const unsigned char *in, unsigned char *out,

long length, const KEY *key, unsigned char *iv, int enc)

{/* 密钥分组连接模式 */

register int i;

int len = 8;

register long l = length;

unsigned char buf[8];


if(enc)/*encryption*/

{

for(l-=8; l>=0; l-=8, in+=8, out+=8)

{

for(i=0; i<8;>
buf[i] = in[i] ^ iv[i];

encrypt(buf, iv, key, &len);/*len == 8 will always be true here*/

for(i=0; i<8;>
out[i] = iv[i];

}

/*final block*/

if(l != -8)

{

for(i=0; i
buf[i] = in[i] ^ iv[i];

for(; i<8;>
buf[i] = iv[i];

encrypt(buf, iv, key, &len);/*len == 8 here*/

for(i=0; i<8;>
out[i] = iv[i];

}

/* 加密输出为做下一次得iv ,iv与in异或运算的结果作为加密输入*/

}

else/*decryption*/

{

for(l-=8; l>=0; l-=8, in+=8, out +=8)

{

decrypt(in, buf, key, &len);

for(i=0; i<8;>
out[i] = buf[i] ^ iv[i];

for(i=0; i<8;>
iv[i] = in[i];

}

/*final block*/

if(l != -8)

{

decrypt(in, buf, key, &len);

for(i=0; i
out[i] = buf[i] ^ iv[i];

for(i=0; i<8;>
iv[i] = in[i];

}

}

l = 0;

i = 0;

}


void cfb64_encrypt(const unsigned char *in, unsigned char *out,

long length, const KEY *key, unsigned char *iv, int *num, int enc)

{/* 密码反馈模式 */

register long l = length;

unsigned char buf[8];

register int i, save = 0, n = *num;/*start from previously saved processing position*/

int len = 8;


/*restore from previously saved iv*/

for(i=n; i<8;>
buf[i] = iv[i];


if(enc)

{

while(l--)

{

if(n == 0)

{

encrypt(iv, buf, key, &len);

save = 1;

}

*(out++) = iv[n] = *(in++) ^ buf[n];

n = (n+1)&0x07;

}

}

else

{

while(l--)

{

if(n == 0)

{

encrypt(iv, buf, key, &len);

save = 1;

}

*(out++) = (iv[n]=*(in++)) ^ buf[n];

n = (n+1)&0x07;

}

}

if(save)/*store encrypted data into iv for next encryption*/

for(i=n; i<8;>
iv[i] = buf[i];

/* cfb加密输出得结果作为下次得IV, in与加密IV的结果作异或运算的结果作为cfb加密的输出 */

*num = n;/*store current processing position as entry of next encryption*/

save = i = n = 0;

}


void ofb64_encrypt(const unsigned char *in, unsigned char *out,

long length, const KEY *key, unsigned char *iv, int *num)

{/* 输出反馈模式 */

register long l = length;

register int i, n = *num;/*start from previously saved processing position*/

int len = 8;

unsigned char buf[8];


/*restore from previously saved iv*/

if(n != 0)

for(i=n; i<8;>
buf[i] = iv[i];


while(l--)

{

if(n == 0)

{

encrypt(iv, buf, key, &len);

for(i=0; i<8;>
iv[i] = buf[i];

}

*(out++) = *(in++) ^ buf[n];

n = (n+1)&0x07; /* n=(n+1)%0x08*/

/* iv加密输出结果作魏下一次iv, iv与in异或运算的结果作为ofb加密输出 */

}

*num = n;/*store current processing position as entry of next encryption*/

i = n = 0;

}

三:如何在SSL协议中添加新的加密算法!

首先,我们来看相关的全局变量 ssl3_ciphers[],在ssl/s3_lib.c中定义

变量的类型定义如下:

typedef struct ssl_cipher_st

{

int valid;

const char *name; /* text name */

unsigned long id; /* id, 4 bytes, first is version */

unsigned long algorithms; /* what ciphers are used */

unsigned long algo_strength; /* strength and export flags */

unsigned long algorithm2; /* Extra flags */

int strength_bits; /* Number of bits really used */

int alg_bits; /* Number of bits for algorithm */

unsigned long mask; /* used for matching */

unsigned long mask_strength; /* also used for matching */

} SSL_CIPHER;


举例来说明

/* Cipher 03 */

{

1,

SSL3_TXT_RSA_RC4_40_MD5, //字符串,在ssl3.h中定义!

SSL3_CK_RSA_RC4_40_MD5, //整形,在ssl3.h中定义

SSL_kRSA|SSL_aRSA|SSL_RC4 |SSL_MD5 |SSL_SSLV3,(在ssl_loal.h中定义)

//密钥交换算法|身份认证算法|加密算法|消息摘要算法|ssl协议版本号

SSL_EXPORT|SSL_EXP40,

0,

40,

128,

SSL_ALL_CIPHERS,

SSL_ALL_STRENGTHS,

},


如果增加了新的加密算法,必须注意定义的所数值是否可用,如果涉及到位与运算,还必须更改相应的掩码!


EVP_CIPHER *ssl_cipher_methods在ssl_ciph中定义,在函数void load_ciphers(void) 中完成对ssl_cipher_methods的初始化工作!

请在这里添加新的加密算法!并且在int ssl_cipher_get_evp(SSL_SESSION *s, const EVP_CIPHER **enc, const EVP_MD **md, SSL_COMP **comp)和static unsigned long ssl_cipher_get_disabled(void)中添加相应的实现算法!