https://github.com/Rdatatable/data.table/wiki/Amazon-EC2-for-beginners Jan Gorecki ·
This explains in a minimal way how to start and use a spot instance on Amazon EC2. There are several good blog articles but I found they didn't cover some aspects in great detail. There are so many services and options provided by Amazon that I found it quite daunting at first when all I wanted was a large memory machine for a few hours every now and again. This is a wiki page so you can easily update and improve it as time passes - press Edit in the top right.
Spot instances are cheap because, should you be outbid by someone else or spare capacity be reallocated, they can be killed at any moment by Amazon with no notice. However, in my experience (so far) that rarely happens. A spot instance is ideal for large data benchmarking and research jobs; i.e. tasks that can simply be restarted should they be killed.
Get to the EC2 Management Console and bookmark it in your browser.
Click Spot Requests in the left hand menu. Your screen should now look like this :
Resist temptation to click the blue "Request Spot Instances" button but click the Pricing History grey button at the top instead.
Change instance type in the drop down at the top to the one you want; e.g. r3-8xlarge. You have to use another source to know how much RAM and how many cores each instance name corresponds to; e.g. http://www.ec2instances.info/. Observe history and current price. If this isn't acceptable, close price history and change the region in the drop down box in the black area at the top right of the Management Console. Then click price history again. Keep changing regions/type until you find a region/type where the price is acceptable. Each region/type combination is priced separately.
Now click the blue "Request Spot Instances" button. Note that this isn't the same as the "Launch instance" button in the Instances view (although that is where we'll view the spot instance in a moment).
Step 1: (Choose an Amazon Machine Image) The Quick Start machine images are selected by default. Choose the Ubuntu one. Currently it's the 4th one down: Ubuntu Server 14.04 LTS (HVM), SSD Volume Type, 64bit. This is a brand new, blank and factory fresh Linux server. Simple. No dependencies. No software or libraries pre-installed that might be out of date.
Step 2: (Choose an Instance Type) Choose r3-8xlarge (244GB RAM and 32 cores).
Step 3: (Configure Instance Details) The maximum bid price is the only one to complete. This is the maximum you're prepared to pay per hour. Start with the current spot price from point 5 above and with knowledge of the history add some margin; e.g., if the spot price is $0.25 then I tend to bid $0.50. Should you be outbid you have no opportunity to increase your bid ... your instance will just be killed instantly.
Step 4: (Add Storage) Next
Step 5: (Tag Instance) Next
Step 6: (Configure Security Group) SSH (port 22) is already open by default. It's important to add HTTP (80) and HTTPS (443) otherwise R can't download packages. Optional: In the security group name field, change "launch-wizard-1" to "R Server", then next time you can just choose "Existing security group" instead.
Click Review and launch
(Select an existing key pair or create a new key pair) Select "Create a new key pair". The "Key Pair name" field is just the name of the file that will be created on your local machine. A different file is needed for each Amazon region it seems. So I have "~/mdowle.pem" for N.California, "~/mdowleOregon.pem" etc. Enter the file name (without the .pem extension) into the field and click the "Download Key Pair" button and save it somewhere within easy reach (I save them in my home directory ~). Next time you can just "choose an existing key pair" and it will find the appropriate .pem file for that Amazon region.
Check the tick box and click "Request Spot Instance" blue button.
Your request will now appear as a new line in the "Spot Requests" view. After at most a minute the status will change from yellow to a green state "active" and status "fulfilled". However, the view does not refresh automatically so you need to click the refresh button in the top right every 10 seconds or so. You can now change to the "Instances" view (INSTANCES=>Instances on the left menu) and you have a new line there as well. Your instance is now running and you are being charged per hour whether it is idle or not. Ensure you don't forget to kill any running instances when you're finished otherwise you'll get a surprise when the monthly bill arrives in your inbox. There are no time limits or warnings about running instances you may have forgotten to terminate.
Select the instance (if not already selected) by ensuring the blue check box is filled in so that the grey Connect button at the top is active and click it. This doesn't really connect, it just displays a window showing you how to connect.
Copy the example line from this window, for example :
ssh -i mdowle.pem firstname.lastname@example.org
NB: The .pem filename and the IP address will be different for you.
Paste this into a shell (I paste it into my editor's shell). Either do this in the directory where you saved the .pem to or include the path to the .pem file. That's why I put the .pem files in ~ to make this easy since the shell opens in the home directory. Enter "yes" to "Are you sure you want to continue connecting (yes/no)?"
You now have a prompt to a factory fresh large-memory machine. Type free -h. Typelscpu. Smile.
I have the following startup script in my editor which I run by pressing F5.
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo add-apt-repository 'deb http://cran.stat.ucla.edu/bin/linux/ubuntu trusty/'
sudo apt-get update
sudo apt-get -y install r-base-core
sudo apt-get -y install libcurl4-openssl-dev # for RCurl which devtools depends on
sudo apt-get -y install htop # to monitor RAM and CPU
options(repos = "http://cran.stat.ucla.edu")
# Use R as normal ...
Once you're used to it, you can get to this point in under 5 minutes.
Start another shell, paste in the same ssh to connect and type htop. Leave this running to monitor RAM and CPU usage on the remote instance.
Type df -h and observe disk size is not large. However you have 244GB of RAM. Use ram risk by writing and reading to /dev/shm, plus that'll be very fast disk access. Even if you use 100GB of ram disk, you'll still have 140GB of RAM. Any results you want to keep, transfer them from the server to your local machine.
To transfer files to and from the server :
# To copy to EC2 (final colon needed):
scp -i ~/mdowle.pem localFile.csv email@example.com:
# To copy from EC2 (final space then dot is needed):
scp -i ~/mdowle.pem firstname.lastname@example.org:~/remoteFile.csv .
Terminate your spot instance
When you're finished, ensure to terminate your spot instance in the correct way using the Management Console. Use the Instances view (the same view you clicked the Connect button), select the instance, click Actions grey button at the top, select Instant state submenu and then Terminate.