Top 10 System Administrator Truths 561
Vo0k writes "What are your top ten system administrator truths? We all know them already, but it's still fun re-telling them. Stuff like "90% of all hardware-related problems come from loose connectors", even though you already know it's true, may save you from replacing the "faulty" motherboard if you recall it at the right time."
All I got to say is... (Score:5, Insightful)
Re:In no particular order.... (Score:2, Insightful)
Rebooting Causes 90% of Unix problems.
Re:In no particular order.... (Score:1, Insightful)
This one is funny. Did you read the BOFH about the manager and google?
Knowledge and experience are you best friend ever. I would quicker trust a coworker then google.
Re:In no particular order.... (Score:2, Insightful)
If a user can cause a problem, then the program is buggy.
Power cables don't really "go" bad. (Score:5, Insightful)
We have the three rules. (Score:5, Insightful)
Rule 2. They don't know they are lying.
Rule 3. Sometimes they are telling the truth. Yes sometimes what you think is impossible really is happening or looks like it is happening.
My 2p (Score:5, Insightful)
Not too bad (Score:4, Insightful)
Is this really the case? We had several JetDirect enabled PCs at my former place of work and almost none of them had a card failure. We even had a few extra cards just in case. Several of the printers were actually quite old even. The biggest problem we had was with only HP-5P (I think that is the number). Some users departments did not have the money to replace those crappy old printers. On a bit of an aside, we had several JetDirect "boxes" (the external box that connected the printer port to ethernet) that were working great. I believe most everyone in the IT staff had one at home for their printers.
No One Ever Got Fired For Buying Microsoft.
Not really true. There are some shops so enamored with Novell (mostly because of bosses stuck in the stoneage) that the idea of purchasing Exchange or using a full out ActiveDirectory system with a Windows only network storage share were unheard of. I once again reference my previous job.
Not too bad of a list overall. Most of the items are right, and it is quite true. To be honest, the places I have worked there were really only a handful of problem employees, and most of them got handled directly by our SysAdmin or the head of IT because no wanted to worry about what lie they may come up with about the work we were doing.
Re:Simple (Score:4, Insightful)
simple rules (Score:2, Insightful)
Re:95% of all problems.... (Score:5, Insightful)
Doing some GUI consultant work and writing a few users manuals for some pretty complex software has taught me one thing: Most user error is the fault of crappy software. A good setup (hardware or software) should be easy to use given the users.
Now, obviously it is all about knowing the audience. If you are writing an application for use by other software engineers versus people living in an assisted living home, well, that makes a difference, and you certainly can't cater to all people (for example the guy who writes code for a living but can't setup his own email at home).
The bottom line is, as much as it displeases us, not everyone is a geek. Not everyone cares about the latest firmware for their router, the latest patch for Call of Duty 2, or how to make a projection TV from an old overhead projector and a laptop from eBay. Our job, as geeks, is not to show everyone why they SHOULD care, but rather to make it easy for those who don't care to still do what they need to do.
Just a few minutes ago I got an email forwarded to me from a "stupid" user who couldn't figure out how to perform what to me seems like a simple task in some software my team wrote. We emailed him the directions, even though they were very clearly stated in the manual that I wrote, but I took it one step further. I submitted a feature request in our bug-tracking database to put a message near where what he was trying to do to explain why that option is grayed out.
Anyone can write software or setup hardware that has tons of geek features that we all like, but it takes a lot more effort to make the setup actually usable to the target users.
Set Standards (Score:4, Insightful)
I've worked in two kinds of places, ones where they set (and stick to) standards and ones that don't. Every place that doesn't use or doesn't stick to standards has always been an experience in wasted time, confusion, and lots of bugs. Those that do can seem like you're always being nagged but in the end you find things work as expected, code is far easier to manage (especially when it is someone else's), and you aren't always having to reinvent the wheel (i.e. figuring out how to fix a subtle bug again because the solution was never written down the first time).
It sounds simple but it takes discipline at all levels. Even something as documenting what you did afterwards and putting it in an orderly file system can make a huge difference but how many people bother to do it? Managers and fellow developers have to crack the whip and keep people from trying to cut corners.
Standards should be open to some change and can be bent but there has to be a very good defendable reason for it.
Re:Simple (Score:4, Insightful)
We had a problem, SQL was performing poorly a typical query on the machine that took 50 minutes was taking 2.5 hours and was sometimes failing. We instantly started looking at data and possible database corruption, the VP of Operations came down and started "directing us" we politely ignored and continued down our path. He then ordered us to rip the heart out of the SQL server, Remove 4 processors, remove 8 gig of ram, downgrade from Enterprise to standard and only 2 processors. over and over he kepts telling to do things that were insane because he usedto be a Ops manager in the company and knew what he was talking about.
4 days later and about 80 hours of wasted overtime we carefully rebuilt the server BACK to a last known good from a backup before the mess and then discoverd that Oh! there was a DATABASE DATA PROBLEM!
If someone start on a wild chase changing things wildly, I do not care who they are, tell them to piss off and please stand behind the glass, Or better yet, do that nicely by getting everyone inclusing the vendor to agree that what they want to do is not the right thing.... Ganging up on them typically works.
So the parent is 1000% correct. Not only is the solution typically simpler than you think but is usually the one that makes the most sense.
if your SQL server suddenly starts acting up after 2 years of good operation, there is almost no chance that ripping it's guts out will help anything.
Re:In no particular order.... (Score:3, Insightful)
Treat users with respect (Score:5, Insightful)
Re:In no particular order.... (Score:4, Insightful)
No, that's a bad project manager... or possibly a bad salesperson.
Good project managers are the other way around: If they hear "5 days" from the developer, they promise it to the customer in 6. This allows a little time for QA testing if the developer gets it done within his 5 days... and allows for a small buffer if the developer doesn't get it done on time.
You don't know everything (Score:5, Insightful)
Re:Top 3 (Score:1, Insightful)
like when you want to:
rm -r tmp_*
avoid the Space Of The Death:
rm -r tmp_ *
Just to PO the "Don't post your list here" folks.. (Score:5, Insightful)
GeekSquad Top Ten? (Score:5, Insightful)
1. It must be this unsupported software: remove Firefox or any F/OSS.
2. It is a virus, your AV is no good, purchase Norton CoverYourAss v9.6 for $49.95.
3. The AV doesn't perform a deep clean by itself, we can run one for $24.95.
4. You need a bigger hard drive, w recommend Norton Ghost to copy it. $199.95 + $49.95.
5. We should install the drive. $24.95 + $8.95 wrist strap.
6. We should run ghost for you, $19.95.
7. You need USB 2.0 ports for your mouse to run faster, $49.95 plus $24.95 installation.
8. Your hard drive cables are old belt style, you needbthe snappy round cables, $29.95 plus $9.95 installation.
9. Your video board is old, the ATI MegaWow XL is only $199.95.
10. You should probably buy one of our Compaq BusinessPro by HP combinations, you burned your TCP/IP converter with static.
I pop open the discarded PC, replace the processor fan and blow out the case. All is fine - $30.
Re:In no particular order.... (Score:5, Insightful)
Well that is usually a half truth. Usually when you reboot a Unix system you do it for the following reasons.
1. You screwed up and have no alternative Interface to get in.
2. Your system has been on so long that you want to reboot it to see what whent down without it telling you.
3. You need to had hardware and it isn't hot swapable.
4. The disadvantage of downtime out waighs the time it will take to fix it without rebooting.
5. You lost power for an extended period of time.
6. Management tells you so.
7. Upgrading the OS to a level all services need to be restated.
8. There are many unknown processes and you want to be sure you are not stopping an important job.
9. Other...
But normally because the drives have been spinning for years. Having it Stop and then start again. Put strain on them and causes them to die. Or if the system has enough memory the drive may have died years ago but all the data is paged.
Re:95% of all problems.... (Score:5, Insightful)
Re:4 Rules (Score:5, Insightful)
God, yes.
"Nothing happens when I check my email."
"Do you get an error message when you try it?"
"There was some dialog on the screen, yeah."
"Grr. What did it say?"
"Oh, I didn't read it"
Aaaarrgggh.
Schmooze the users (Score:5, Insightful)
They started waiting for me to stroll in instead of paging me at night, just to be nice to me.
But the best part was, they thought of me as the guy who keeps the system running, because most of the time that I showed up, the system was running.
My colleagues who only showed up when their systems broke had the reputation "Here comes trouble!"
Taking credit for things going well is essential!
The Truths of Skippy (Score:5, Insightful)
A few for Oracle SysAdmins (Score:3, Insightful)
9) Patching only works 30% of the time
8) Metalink is like a massive "Magic 8 Ball" that pulls responses from the database. Treat it as such.
7) Tars are the same as 8, except you have a customer service rep reading the 8 Ball.
6) If it generates core files it's the DBA's problem.
5) It's ALWAYS the DBA's fault.
4) RMAN is your friend.
3) You know more about Apache than Oracle does.
2) Oracle won't admit this.
1) Autconfig doesn't.
Re:95% of all problems.... (Score:5, Insightful)
Application > Presentation > Session > Transport > Network > Data Link > Physical. This order is actually from layer 7 to 1.
If you had followed the OSI model, you would've found out that the *first* thing to do would be to check the physical connection (aka power cord) and found your problem right away.
Re:Top 3 (Score:3, Insightful)
I also make a point of instinctually typeing WHERE immediately after a DELETE statement in SQL, then using the arrow keys to add the information between the two. Nothing like someone distracting you, and hitting return when your SQL statement says "delete from reallyImportantTable"
(For those in the audience not in the know, that will tell SQL to automatically delete all records from the table)
95% of bad GUI design.. (Score:5, Insightful)
I agree, good software should be intuitive, but far better to be proactively engineered to be more intuitive, rather than reactively veneered to feel less unintuitive.
Re:In no particular order.... (Score:4, Insightful)
Re:95% of all problems.... (Score:3, Insightful)
No, most user error comes from the fact that they are forced to learn a new package almost every year. If you think about an automobile's interface, it is pretty damn unintuitive. But because it has been more or less in the same form for decades, we hairless apes have adapted to it, and make rude remarks about those who can't figure it out.
Take the key and insert it into the ignition switch. On manual ignition cars, hold down the clutch (furthest left) pedal. Turn the ignition switch 180 degrees until you can hear the starter motor turn the engine over. Immediately let go of the key when combustion begins. After the engine has had some time to warm up, tap the accellerator to release the choke...
People can learn some really complex things, given enough time and experience. We just don't allow people either when rolling out computer systems.
(Spoken as the guy who was programming VCRs at 4, and who has managed to work his way through every computer interface he's every sat in front of.)
Users are idiots, but (Score:3, Insightful)
Users are idiots. This is a good thing.
We expect them to be computer illiterate, and they rarely disappoint.
If I'm working at a biotech company, I don't want the researchers to be good at computers. If I'm working at an investment firm, I want the users to understand investments, not DLL's.
We're here precisely so that they can be idiots at computers... and experts at whatever it is they do when their computers aren't broken.
The company isn't here so that we have a network to play with.
Learn to praise the users' idiocy, they'll appreciate it.
If the users get frustrated, empathize with their confusion and blame Microsoft. Never fails.
Re:#10 Reboot should be #1 (Score:2, Insightful)
Indeed, "when all else fails". I see too many technical people reboot before understanding the problem. Though it may work and though it may be faster, they haven't learned anything about what was happening. Furthermore, if there was a malicious cause for the problem, rebooting has a better chance of erasing the evidence.
If doctors kill patients as a means of troubleshooting...
Re:In no particular order.... (Score:5, Insightful)
"Rebooting Solves 90% of Windows problems"
Nope. Rebooting only clears 90% of symptoms, it doesn't necessarily make the problems go away. For example, if you have a webserver that's got a memory leak and that leak takes 72 hours to fill RAM to the point that the system becomes unusable, rebooting clears the symptom (unusable system) but doesn't resolve the problem (bug in the webserver). Too many people think that the reboot fixes the problem, so they don't ever bother finding out what the real problem is.
Re:95% of bad GUI design.. (Score:3, Insightful)
In such cases, many times the best thing to do is examine what the overall purposes of the software is supposed to be and start over from scratch, but engineer the new solution, rather than cobble it together.
Clearly someone has confused terms. (Score:2, Insightful)
#1. The User has no idea what you do, but they will blame their problems on you.
Some guy on the support desk bumped a call to you saying you had a server problem, the user doesn't have internet access, forward it back to the desk.
#2. Other employees have no idea what you do, and will try to pass the buck to you.
Exchange is down and you're a Network Admin? Well suddenly there's a network problem, certainly not a problem with the Exchange server. There's a network problem? Well then you're a Windows Admin and it's clearly your fault. No matter what they will always find a way to blame their problems on you.
#3. If you're doing your job well, they will fire you.
Congratulations, your systems stay up all the time, maintain themselves, and building a new system for your environment is such a painless and well-documented procedure it takes minutes and a monkey could do it. So we're hiring a monkey. Get out.
#4. If you're doing your job poorly, they will promote you.
#5. A crashed system holds many secrets, do not reboot it.
You need to find the source of the problem, or it will come back to haunt you. If you're fine with rebooting six times a day, be my guest.
#6. Backups are important, but multi-layered redundancy is the way to go.
Why have one webserver when you can have two? Why have one mailserver when you can have two? Have two locations? Why not have four mailservers? The more redundancy you have in your systems and your network, the less you have to care about midnight outages.
#7. If someone needs to tell you to be more polite, you have no business working in IT.
Seriously, you screwing up screws up everyone's day. You need to be meek, friendly, and try your best not to let everyone know you make more money than them. Except Sales and Marketing. They scoff at your puny salary.
#8. Always ask for a lot more than you need.
A big project require four servers? Ask for eight. Setting this up is going to take a week? Tell them it will take two. It never hurts to try to get a safe buffer, and you'll never get what you ask for anyway unless you're amazingly lucky. Either way, if you say it takes two weeks and you get it done in one, you look awesome. If you tell them it takes eight servers and they give you four, you look like a rock-star when you get it to work. If they give you the eight servers anyway...well...it looks like you have some spares for once.
#9. If it can't be done, say so.
Don't get yourself involved in an impossible project. Doing the impossible might be part of your job, but if you don't know OpenView and your manager wants OVO to be making his coffee in the morning, don't tell him you can make it happen.
#10. Always look for something to improve.
This is what most admins forget. An idle admin is a fired admin, and an idle admin eventually is a stale and clueless admin. Remember, your manager will never hate you for suggesting new projects, and for suggesting things can be done better, especially if it's free. I find when I'm sitting idle between projects there's nothing better than to give myself a project and FINISH it. There's nothing quite as nice as informing your manager that there's a better way to do something, and that you've already set it up.
One truth I have learned in my exhile to the Abyss (Score:1, Insightful)
Everytime I rolled out a config with some small error or missing component or feature someone found it and complained. I thought it was just coincidence or takig too long to path. Noppers, they have fingers and will click anything with a shorcut or [OK] button.
The latest gotcha for our dept. was the user who discovered that ".jpe" files were not opening. Two days of my life I will never get back
Heybiff
One thing I have leaned (Score:2, Insightful)
Have a Knoppix (or other) LiveCD nearby (Score:5, Insightful)
yes, but not the aura (Score:3, Insightful)
I reckon this has most to do with approach... users, especially the non-techy variety, tend to approach things in the same casual way they approach TV, or writing a reprort... casually, and intuitively... we aren't like that, generally; geeks are methodical... every step we take is scripted, and we're analyzing what we're doing as we're doing it...
remember trying to get those first couple of computers to talk to each other when you were a kid? one of the things we learned from that was approach: mentally cataloging each step along the way so it could be duplicated later... we deal with most things (and *especially* troubleshooting things) with the scientific method *firmly* implanted at the front of our conciousness
when the dvd doesn't work right for the non-tech, it's probably error related, but they wouldn't know that, because they just did what *feels* right... our "aura" is our ability to approach things methodically
Re:yes, but not the aura (Score:3, Insightful)
The difference is that most users are trained in pure buttonology; they have been taught to press f1, then f3, then f8, write down the displayed result, then press f5 and start over. This works fine until the first, slightest little thing goes wrong, e.g. they 'fat-finger' f2 instead of f1. Suddenly they're in a confusing world they don't understand and can't deal with at all.
I don't have time to teach them every possible screen they can reach, but I do make the effort to ensure they understand what 'f1' actually does, and why they're following the sequence they do. The result goes way beyond what I actually teach them; it gives them confidence in their ability to master the sytem, and when they do have to call me it's usually with a real problem, not I-pressed-f3-and-I'm-scared and usually they've already collected at least some of the information I need to fix it over the phone.
In my experience most users aren't dumb, they're ignorant and frightened. Taking a little time to erase the ignorance eases the fear, and saves me a lot of headaches. Of course, I don't get to tell as many PEBKAC stories as some of my cow orkers, but the ones I do collect tend to be duesies.
Re:Top 3 (Score:4, Insightful)
Never ask dumb questions like that. It embarrasses the user for no good reason. Find a subtle way of getting them to check the power without forcing them to reveal their mistake. Such as:
They'll still learn the lesson - check the power before calling tech support - but now they won't feel so uncomfortable that you were mocking them with your questions.
Re:In no particular order.... (Score:2, Insightful)
I can't recall the number of times "the problem is solved!" by rebooting only to happen again a week later.
And always at a worse time.
User equipment (Score:2, Insightful)
2. Never use the user's keyboard -- see #1 and multiply by 10.
3. The user's keyboard usually contains items which did not stay in the user's mouth. Bring your own keyboard.
4. If hygenic input devices are not available, create a reason to work on the tower in your space where hygenic input devices are available.
5. If the system cannot be removed, engage the user. Make the user root/administrator thru remote interfaces; then direct the user thru the steps to correct the problem.
6. If you must use the user's input devices, maintain a supply of surgical scrub solutions in your personal toolkit.
Education Mitigates Ignorance (Score:3, Insightful)
Users are not stupid... they are ignorant. They don't understand why it is failing. They may even be very knowledgable, just not in the domain of the current problem.
While you're waiting for that reboot, why not explain to the user what you suspect the problem is, and why. When they get confused between their email address and their username, clarify and define the terms. When they put www in front of every URL, whether it should be there or not, explain about how hostnames are a custom, not a rule.
"Type email.example.com in the address bar at the top."
"It says host not found?"
"Read me the address bar, letter by letter..."
"http://www..."
"Hold right there... the address I gave was email.example.com. Not all websites begin with www, just most."
"Ahh, gotit, lemme retype it..."
"Hostnames are just names... we could have called it fluffy.example.com if we wanted, but that would be silly. *chuckle* Ok, now that you have that typed in..."
And hopefully that user will remember from then on that websites don't have to begin with www. They may even look at and notice the alternate hosts various sites use. They learned something, it took only a few seconds longer, and the user will hopefully know a little more about the background behind the stuff they are told to do.
If you take a few seconds out of every call to combat ignorance, pretty soon you'll start getting fewer calls. At the very least, the calls will be more tolerable because the user won't be making the same completely stupid mistakes over and over because they don't understand.
If every tech took a few seconds to combat ignorance, we could actually make a difference.
The Raven