Given a CGI variant file, this very basic script will extract all single-base, diploid SNPs and convert them into a PLINK TPED (transposed PED) file. Note that this script does not generate a corresponding TFAM file; you will have to write one. See http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr for more information on how to do this. You can then use PLINK to convert the TPED and TFAM files into the more commonly used PED/MAP or BED/BIM/FAM file formats.
For the input, you can provide either the original BZ2 compressed variant file, or the uncompressed file (the script will use the file extension to determine which).
Note that the xRef field for some variants will list more than one dbSNP ID for that variant, for example:
By default, the ID assigned to the SNP in the output file will be the last one listed in the xRef field, so the corresponding line in the output would be:
1 rs55874132 0 69552 C C
If instead you want to use the first rs ID listed, use the -f switch on the command line passed to the script. You would instead get:
1 rs2531266 0 69552 C C
If the xRef field is empty, the SNP will be given a name based on the chromosome and position, like so:
1 SNP-1-69552 0 69552 C C
This script requires python 2.4 or higher. Usage example:
python cgivariants2plink.py -v var-GS000001234-ASM.tsv.bz2 -o var-GS000001234-ASM
Download the tool here.