Tag Archives: data science

Cheat Sheets for Stata

Cheat sheets for programming languages were commonplace before the internet became widely available with powerful search engines. Nowadays, I believe that cheat sheets are still very useful because search engines become simply too powerful and provide more answers than required by the question. Some of the most popular cheat sheets for Stata were prepared by Tim Essam and Laura Hughes from the US Agency for International Trade and Development. Follow them on twitter: @StataRGIS and @flaneuseks. Here are some of their cheat sheets:


Commands for Data Analysis

Programming

Commands and functions for data processing

Commands and functions for data transformation

Learning the basics of Unix

Teaching computer skills to students has always been a challenge in Economics. The first difficulty is the limited amount of time for them to learn the basics of any software. A second difficulty is when the software is only available for Unix. Recently, I found a very simple yet comprehensive tutorial for the basics of Unix. Most of my students that used it were able to quickly gain some proficiency and write better code, especially when using bash.

Statistical Learning using R

I recently came across this book titled “An Introduction to Statistical Learning, with Applications in R“.

It can be downloaded for free at the authors webpage, which also contain the R codes, data sets, errata, slides and videos for Statistical Learning MOOC, and other valuable information.

That said, I think this is a very useful book for those interested in Statistical Learning. It is very accessible to most people, since it does not require a strong mathematical background.

For those interested in gaining a deeper understanding of these topics, I strongly suggest the book “The Elements of Statistical Learning“, which is also available for download at no cost.

Stata tip: creating a local containing all (or almost all) variables of the data set

Locals containing a list of variables can be very useful when using Stata. A common need is a local containing all variables of a data set. This local can be created by means of the ds command.

Here is an example using the lifeexp.dta data file.

. webuse lifeexp, clear
(Life expectancy, 1998)

Now, let’s create a local named allvar that will contain all variables of this data set.

. ds
region country popgrowth lexp gnppc safewater

. local allvar `r(varlist)’

. di “`allvar'”
region country popgrowth lexp gnppc safewater

 

We can see that ds stored the variable list into r(varlist). One interesting variation is the creation of a local containing all variables except region. You will need to specify the variables to be escluded right after ds, and add the option not after a comma.

. ds region, not
country popgrowth lexp gnppc safewater

. local othervar `r(varlist)’

. di “`othervar'”
country popgrowth lexp gnppc safewater

The command ds has several other useful applications that will be commented later in this blog.

 

Transferring IPEADATA series to Stata

A common issue that arises when converting time series data from IPEADATA to Stata format is dealing appropriately with the time variable. For instance, for monthly series the date format will be YYYY.MM. Stata usually interprets this format as numeric.

Suppose you already downloaded a monthly series from IPEADATA and transferred it to Stata. It is very likely that the date variable (let’s call it date) has been automatically handled as a numeric variable. The first thing to pay attention is that the numeric format disregards zeroes on the right-hand side of the decimal point. This means that October of 1940 is coded as 1940.10 by IPEADATA and interpreted as 1940.1 by Stata. To recover the missing zero, the first step is to convert this variable to string format. This can be done with the string() function.

generate sdate=string(date)

To add back the missing zeroes, we can do the following:

replace sdate=sdate+”0″ if length(sdate)<7

Now, we just need to tell Stata to interpret sdate as a monthly date variable. This can be accomplished with the command numdate. This is not a standard Stata command and needs to be installed in your computer (ssc install numdate).

numdate mo newdate = sdate, pattern(YM)

The above line can be interpreted as create a new date variable named newdate from variable sdate that is in the YYYY.MM format.

The numdate ado file can deal with very flexible date specifications, and its help file is very comprehensive. Two other useful commands are convdate and extrdate. They are used to convert or extract parts of dates from variables that are already in the Stata date format.

A final recommendation is to take a look at Stata documentation on dates that is available at http://www.stata.com/manuals13/ddatetime.pdf.

A good tutorial for learning the basics of Python for data analysis

I founds this interesting tutorial for Python. In my opinion, Python is a very simple and intuitive language, and at the same time it is very powerful. This link leads to a straightforward tutorial of Python that focuses on the basic knowledge needed to use Python for data analysis.