This notebook will include example for how to:
- Open ascii files manually
- Open ascii files using the Pandas module
- Plot the data which has been parsed
- Save the results of those plots# The first step when using python is to import the packages which you want to use.
# is is good practice to import all the packages you want to use in one place (as opposed to throught the file)
# Packages are bits of codes other have written and you have installed which you can call from your programs
# this is super helpful as it means you don't have to write all this code again
# Some particularly helpful packages are:
#   - numpy for more powerful mathematics operations such as linear algebra and statistics
#   - matplotlib for graphing and visualizaton
#   - scipy for some generally helpful scientific tools such as curve fitting
#   - pandas for big data handeling
#   - astropy for tools specific to astornomy such as time corrections and sky coordinates
# The syntax to bring packages into python can be somewhat confusing, I'll try to break down the basic rules below
# import statments can either start with "import" or "from"
# statements are basically human readable
# the "as" keyword is like assigning a shortcut name to a thing
# so for example you might say
# import numpy as np
# this will bring in all the code from the numpy package to your script. You can then access the code by writing np.<function_name> (for example np.sin(2*np.pi) would return the sin of 2 pi)
# if you didn't have as and instead just wrote
# import numpy
# you would instead write numpy.sin(2*numpy.py)
# you can also import specific things from a package
# from numpy import pi, sin
# in this case you can simply write sin(2*pi) but you wont have access to any of the other things within the package (np.<function_name> won't work because you never defined what np is)
# sometimes packages have subpackages within them
# import numpy.random as rand
# here the subpackage is called random, you can then access anything within random by doing rand.<function_name> (such as rand.normal() for a normal distribution)
# similarly to above you could also just grab one part
# from numpy.random import normal as norm
# this would then let you simply call norm()
# note how you can combine / mix and match the import, from, and as keywords
# Below I will import a few important packages in the way that I most commonly use them (including their most common / standard abbriviations)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from scipy.stats import gaussian_kde
from matplotlib.colors import LogNorm
FILEPATH = "hlsp_acsggct_hst_acs-wfc_ngc2808_r.rdviq.cal.adj.zpt"
# As a first check lets run a line of code that will make sure the file exists where we think it does!
# If the file exists then this will output okay, otherwise it will raise an error
# don't worry about the details of how the error is raised right now
if not os.path.exists(FILEPATH):
    raise OSError(f"file {os.path.basename(FILEPATH)} does not exist!")
else:
    print("Okay!")
Okay!
Now we want to be able to read in the contents of the file and do things with them. We can do this "manually" or we can use some code others have written to. While generally I just use pandas (the code in question that others have writte), there is a lot of value in being able to do this yourself. That's because oftentimes files are quite complicated and the prebuilt code simply does not know what to do with them
To read in a file we have to go through three steps
1) open the file
2) copy the contents from the file into the programs memory
3) close the file
in basic that might look like the following
file = open(FILEPATH)
file_contents = file.read()
file.close()
# because the file is really large it would take a long time to print all of it so this just prints the first 100 charectars to the screen, just to make sure this worked
print(file_contents[:100])
id x y Vvega err VIvega err Ivega err Vground Iground Nv Ni wV wI xsig ysig othv othi qfitV qfitI R
There is actually a slightly better easier way to do this
with open(FILEPATH) as file:
    file_contents = file.read()
    
print(file_contents[:100])
id x y Vvega err VIvega err Ivega err Vground Iground Nv Ni wV wI xsig ysig othv othi qfitV qfitI R
Here the "with" block (called a context manager) took care of opening and closing the file for us (this is good cause it can be hard to remember to close a file so its nice to have the program handel it for you). When you need to open a file always try to use a context manager.
We now have the contents of the file read into memory (put another way we have stored them in the variable called file_contents). We now need to "parse" them. In this case we need to take this raw array of charectars and turn it into some structured table of numbers. When you are writing code to parse the contents of a file the first step is always to open the file in a text editor and look at its structure. Then you look for patterns which you can exploit to pull out the data that you want. Below is a snippet from the file we are interested in parsing
id x y Vvega err VIvega err Ivega err Vground Iground Nv  Ni wV wI xsig ysig othv othi qfitV qfitI RA Dec               
     1 1500.801  975.634   19.905   0.0029    0.634   0.0043   19.271   0.0032   20.077   19.230  1  1  1  1  0.014  0.010  0.000  0.000  0.059  0.024   138.0592168   -64.8911376
     2 1532.085  861.844   21.514   0.0061    0.917   0.0085   20.597   0.0059   21.783   20.563  1  1  1  1  0.022  0.001  0.121  0.009  0.067  0.152   138.0581958   -64.8927183
     3 1530.230  872.678   19.653   0.0026    0.600   0.0039   19.053   0.0029   19.811   19.011  1  1  1  1  0.016  0.009  0.002  0.002  0.054  0.027   138.0582561   -64.8925679
     4 1523.191  888.185   23.043   0.0125    1.028   0.0170   22.015   0.0115   23.347   21.984  1  1  1  1  0.056  0.040  0.000  0.000  0.116  0.075   138.0584862   -64.8923524
     5 1527.557  920.728   22.079   0.0080    0.853   0.0113   21.226   0.0080   22.328   21.189  1  1  1  1  0.005  0.001  0.000  0.000  0.072  0.048   138.0583423   -64.8919005
     6 1533.630  935.460   23.917   0.0189    1.475   0.0236   22.442   0.0141   24.339   22.430  1  1  1  1  0.032  0.021  0.357  0.168  0.074  0.194   138.0581433   -64.8916959
     7 1516.235  948.472   24.170   0.0215    0.935   0.0298   23.235   0.0207   24.445   23.201  1  1  1  1  0.043  0.006  0.023  0.033  0.181  0.218   138.0587125   -64.8915150
     8 1525.199  951.999   21.847   0.0072    0.780   0.0103   21.067   0.0074   22.071   21.029  1  1  1  1  0.010  0.001  0.006  0.006  0.067  0.058   138.0584188   -64.8914661
     9 1533.446  942.922   19.697   0.0026    0.619   0.0040   19.078   0.0029   19.864   19.036  1  1  1  1  0.006  0.008  0.000  0.000  0.051  0.032   138.0581490   -64.8915923
    10 1521.701  965.747   20.131   0.0032    0.620   0.0048   19.511   0.0036   20.298   19.469  1  1  1  1  0.011  0.007  0.000  0.000  0.055  0.035   138.0585330   -64.8912751
    11 1524.800  975.675   24.433   0.0245    1.345   0.0312   23.088   0.0194   24.824   23.070  1  1  1  1  0.041  0.054  0.174  0.101  0.199  0.238   138.0584313   -64.8911373
    12 1532.479  983.762   20.286   0.0035    0.649   0.0052   19.637   0.0038   20.464   19.596  1  1  1  1  0.008  0.013  0.002  0.003  0.049  0.038   138.0581797   -64.8910250
    13 1554.750  796.796   21.106   0.0051    0.674   0.0075   20.432   0.0055   21.292   20.391  1  1  1  1  0.022  0.045  0.000  0.000  0.193  0.160   138.0574554   -64.8936220
    14 1560.846  815.714   21.803   0.0070    0.826   0.0100   20.977   0.0071   22.044   20.939  1  1  1  1  0.011  0.010  0.000  0.000  0.059  0.053   138.0572552   -64.8933593
    15 1549.266  849.083   18.898   0.0018    0.691   0.0027   18.207   0.0020   19.091   18.167  1  1  1  1  0.004  0.012  0.000  0.000  0.051  0.034   138.0576335   -64.8928958
    16 1541.173  853.718   23.906   0.0188    1.344   0.0239   22.562   0.0148   24.296   22.543  1  1  1  1  0.016  0.048  0.351  0.179  0.214  0.148   138.0578985   -64.8928313
    17 1552.946  842.371   23.734   0.0173    1.345   0.0221   22.389   0.0137   24.125   22.371  1  1  1  1  0.036  0.002  0.640  0.381  0.125  0.110   138.0575132   -64.8929890
    18 1539.413  877.985   20.310   0.0035    0.646   0.0052   19.664   0.0039   20.486   19.623  1  1  1  1  0.008  0.008  0.016  0.027  0.059  0.032   138.0579555   -64.8924942
    19 1550.015  887.235   24.402   0.0239    1.393   0.0301   23.009   0.0183   24.804   22.993  1  1  1  1  0.025  0.021  0.004  0.003  0.145  0.117   138.0576080   -64.8923659
    20 1540.053  873.743   22.633   0.0103    0.957   0.0142   21.676   0.0098   22.916   21.642  1  1  1  1  0.011  0.005  0.840  0.573  0.106  0.068   138.0579347   -64.8925532
    21 1558.787  885.288   24.248   0.0221    1.269   0.0286   22.979   0.0181   24.620   22.957  1  1  1  1  0.033  0.045  0.004  0.006  0.177  0.107   138.0573210   -64.8923930
    22 1543.851  901.133   21.903   0.0073    0.784   0.0105   21.119   0.0076   22.128   21.081  1  1  1  1  0.013  0.007  0.000  0.000  0.047  0.035   138.0578096   -64.8921728
    23 1561.731  909.985   24.552   0.0257    1.391   0.0324   23.161   0.0197   24.954   23.145  1  1  1  1  0.008  0.035  0.177  0.045  0.204  0.124   138.0572241   -64.8920500
    24 1544.442  935.553   24.960   0.0314    1.664   0.0378   23.296   0.0210   25.422   23.294  1  1  1  1  0.008  0.043  0.006  0.006  0.211  0.228   138.0577894   -64.8916947
    25 1549.343  962.155   22.845   0.0114    1.026   0.0155   21.819   0.0105   23.149   21.788  1  1  1  1  0.013  0.015  0.000  0.000  0.078  0.059   138.0576284   -64.8913253
    26 1556.440  947.695   23.489   0.0154    1.057   0.0208   22.432   0.0140   23.801   22.402  1  1  1  1  0.013  0.030  0.000  0.000  0.135  0.077   138.0573964   -64.8915262
    27 1539.592  985.276   21.826   0.0071    0.843   0.0100   20.983   0.0071   22.071   20.946  1  1  1  1  0.004  0.007  0.041  0.040  0.061  0.052   138.0579470   -64.8910041
    28 1562.309  978.534   25.934   0.2530    1.354   0.3379   24.580   0.2240   26.326   24.562  2  2  1  1  0.001  0.116  0.387  0.294  0.488  0.393   138.0572035   -64.8910980
    29 1575.370  735.744   23.939   0.0191    1.212   0.0249   22.727   0.0160   24.295   22.702  1  1  1  1  0.023  0.007  0.017  0.013  0.118  0.077   138.0567819   -64.8944702
    30 1586.542  715.469   20.932   0.0047    0.678   0.0069   20.254   0.0051   21.120   20.214  1  1  1  1  0.005  0.021  0.000  0.000  0.059  0.038   138.0564167   -64.8947519
    31 1579.079  748.357   23.392   0.0147    1.197   0.0193   22.195   0.0125   23.744   22.170  1  1  1  1  0.013  0.014  0.013  0.009  0.064  0.085   138.0566601   -64.8942950
    32 1587.393  738.984   20.093   0.0032    0.627   0.0047   19.466   0.0035   20.262   19.424  1  1  1  1  0.003  0.016  0.000  0.000  0.059  0.041   138.0563883   -64.8944253
    33 1582.392  784.999   21.064   0.0050    0.785   0.0072   20.279   0.0051   21.291   20.241  1  1  1  1  0.006  0.008  0.001  0.003  0.054  0.037   138.0565509   -64.8937862
    34 1584.768  779.082   23.791   0.0178    1.193   0.0234   22.598   0.0151   24.142   22.573  1  1  1  1  0.021  0.050  0.201  0.123  0.117  0.082   138.0564731   -64.8938684
    35 1584.765  801.290   20.786   0.0044    0.658   0.0065   20.128   0.0048   20.967   20.088  1  1  1  1  0.008  0.011  0.000  0.000  0.054  0.036   138.0564726   -64.8935599
    36 1587.570  829.191   20.571   0.0040    0.675   0.0059   19.896   0.0043   20.759   19.856  1  1  1  1  0.012  0.001  0.000  0.000  0.060  0.036   138.0563803   -64.8931724
    37 1568.598  870.703   23.880   0.0185    1.216   0.0242   22.664   0.0155   24.238   22.640  1  1  1  1  0.034  0.002  0.040  0.039  0.153  0.080   138.0570002   -64.8925957
    38 1581.094  877.097   23.736   0.0173    1.135   0.0230   22.601   0.0151   24.071   22.573  1  1  1  1  0.007  0.020  0.000  0.000  0.137  0.074   138.0565912   -64.8925070
    39 1568.746  899.814   19.070   0.0020    0.663   0.0029   18.407   0.0022   19.253   18.366  1  1  1  1  0.013  0.008  0.000  0.000  0.061  0.035   138.0569946   -64.8921914
    40 1578.019  906.725   23.875   0.0185    1.310   0.0237   22.565   0.0148   24.257   22.545  1  1  1  1  0.035  0.048  0.076  0.070  0.202  0.238   138.0566910   -64.8920955
    41 1585.284  892.127   21.174   0.0052    1.575   0.0064   19.599   0.0037   21.618   19.592  1  1  1  1  0.014  0.011  0.001  0.001  0.056  0.030   138.0564537   -64.8922983
    42 1586.016  903.027   26.981   0.0958    2.180   0.1054   24.801   0.0441   27.533   24.832  1  1  1  1  0.019  0.046  9.900  0.189  0.692  0.474   138.0564293   -64.8921469
    43 1580.735  896.629   24.186   0.0214    1.258   0.0277   22.928   0.0176   24.554   22.906  1  1  1  1  0.050  0.069  0.209  0.316  0.106  0.154   138.0566026   -64.8922357
    44 1568.822  920.474   22.433   0.0094    0.987   0.0129   21.446   0.0088   22.725   21.414  1  1  1  1  0.009  0.014  0.000  0.000  0.066  0.082   138.0569918   -64.8919044
    45 1583.164  929.258   25.278   0.3970    1.472   0.3979   23.806   0.0270   25.699   23.793  2  1  1  1  0.048  0.049  0.019  0.013  0.306  0.141   138.0565222   -64.8917826
    46 1574.176  936.517   26.688   0.6730    1.588   0.6749   25.100   0.0510   27.135   25.094  2  2  1  1  0.064  0.110  4.400  2.114  0.880  0.494   138.0568160   -64.8916817
    47 1563.259  957.949   26.519   0.0725    1.926   0.0827   24.593   0.0398   27.030   24.607  1  1  1  1  0.030  0.057  0.331  0.137  0.551  0.320   138.0571729   -64.8913839
We can see that this file has the format of
1) A header line with column names seperated (more formalled called delimited) by spaces
2) 2 blank lines
3) Data lines delimited by spaces and intended by a tab when compared to the header
This is all the information we need to write a simple parser. Recall that the variable file_header stores the contents of the file.
Let's start by extracting the header
# split file_contents into an array of strings based on lines. Each line becomes a new entry into this array. the \n charectar is the UNIX new line charectar.
# This will work on any "UNIX-Like" system. This includes Mac OS and Linux. This does not always work on windows which sometimes uses a different line feed charectar.
lines = file_contents.split('\n')
# get the first (0th because python counts from 0) line, which we already know is the header line
header_row = lines[0]
# We now have the first row but really we want that to be a list of the column names. We can use the split() function to split a string into a list of strings based on whitespace
# every white space charectar will be removed and the strings between them will be added to a new list (in this case called header)
header = header_row.split()
print(header)
['id', 'x', 'y', 'Vvega', 'err', 'VIvega', 'err', 'Ivega', 'err', 'Vground', 'Iground', 'Nv', 'Ni', 'wV', 'wI', 'xsig', 'ysig', 'othv', 'othi', 'qfitV', 'qfitI', 'RA', 'Dec']
Now we have a list of the column names. Next we will extract the data
# Here we get all the lines after (but not including) the third line up do (but not including) the final line (thats what the n:m sytnax does, start at n non inclusive and get everything after it up until m)
# negative numbers count back from the end so -1 is the last line
# we want to drop the last line because if you open the file and look at it you will see the last line is blank with no data on it
data_list = lines[3:-1]
# data_list is now a list of lines, we want to split each list into another list of numbers. There are a few ways you could do this;
# however, one of the best is with something called "list-comprehension"
# This lets you iterate over elements of a list and preform an operation on each one one at a time
# So below we say that for each row (x) in data_list (each row in the file) split that row in the same way we split
# the column names from before. We will be left with a 2D list (a table)
data_seperated = [x.split() for x in data_list]
# One major problem remains however. The data in data_seperated are in the form of strings
# that is to say that while they look like numbers, the computer knows them only as the charectars that represent those
# numbers. It does not know they are actually numbers and can therefore not do any math with them.
# Most languages (Python included) have whats called "casting" where you can take the string representation of a number
# and turn it into the numeric representation (tell the computer its a number and not just a set of charectars)
# next we will cast all the elements of data_seperated to "floats" (decimal numbers)
# recall that data_seperated is a 2D list so we need to loop over 2 indicies. Once again list comprehension makes
# this quite straigtforward
# Here we say for every item (y) in every row (x) in data_seperated turn y into a float
data_cast = [[float(y) for y in x] for x in data_seperated]
Now we have a numeric representation of the data from the file which we opened. The final thing we want to be able to do is to connect these with the header names from before. There is a data structure in python called a dictionary (more generally this is a hash-map) which allows you to index data with strings. We will use the column name as the index and the column as the data to index
# This is a lot like list comprehension, this is dictoinary comprehension (note the {} as opposed to []). 
# Dictionaries are key-value paried data-structures, the key is defined on the left side of the :
# The value assigned to that key is on the right side of the :
# Here we define a dictionary called parsed
# We first loop over the "enumeration" of header
# this means loop over every element of header (recall these will be the column names) but don't just return the value, also count up from 0 (in other words return the place that value is in the list).
# The enumeration returns the index then the value, so when we say for column_number, column_name in enumerate(header) each loop iteration the index will
# unpack into column_numebr and the value at that index will unpack into column_name
# we use the column name to define the key for that element in the dict (the left side of the :) and we use the column number to select the correct element from each row
# in the data table we generated in the previous cell
parsed = {column_name: [row[column_number] for row in data_cast] for column_number, column_name in enumerate(header)}
We now have fully parsed this file, you can look at parts of it using the header names, for example if we wanted to plot (spoilers for latter) the x and y positions of the first 100 of these targets we could do the following
x = parsed['x'][:100]
y = parsed['y'][:100]
plt.plot(x, y, '.')
[<matplotlib.lines.Line2D at 0x7fc5f81a73c8>]
Now that's all well and good and once you get used to it it doesn't take too long to write; however, we don't normally have a file which is so spcialised we need to write our own parsed. Normally you can use a pre-canned one. The python package pandas has the ability to parse many files built right in
# All of what we did before (and a good bit more behind the scenes housekeeping) is handeled in this one line
# Read_csv is the function to read an ascii file
# start by giving it the filepath
# then we need to tell it how the file is delimited. By default it will assume commas seperate the data
# here however data is seperated by spaces. We can use whats called a regular expression to generally
# to account for seperation by all whitespace (spaces, tabs, a few other esotaric charectars)
# finally we tell pandas to skip blank lines because we know we have 2 blank lines between the header and the data
parsed = pd.read_csv(FILEPATH, sep=r"\s+", skip_blank_lines=True)
# one of the nice things pandas does is give us this easy to read table output when we look at our data
parsed
| id | x | y | Vvega | err | VIvega | err.1 | Ivega | err.2 | Vground | ... | wV | wI | xsig | ysig | othv | othi | qfitV | qfitI | RA | Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1500.801 | 975.634 | 19.905 | 0.0029 | 0.634 | 0.0043 | 19.271 | 0.0032 | 20.077 | ... | 1 | 1 | 0.014 | 0.010 | 0.000 | 0.000 | 0.059 | 0.024 | 138.059217 | -64.891138 | 
| 1 | 2 | 1532.085 | 861.844 | 21.514 | 0.0061 | 0.917 | 0.0085 | 20.597 | 0.0059 | 21.783 | ... | 1 | 1 | 0.022 | 0.001 | 0.121 | 0.009 | 0.067 | 0.152 | 138.058196 | -64.892718 | 
| 2 | 3 | 1530.230 | 872.678 | 19.653 | 0.0026 | 0.600 | 0.0039 | 19.053 | 0.0029 | 19.811 | ... | 1 | 1 | 0.016 | 0.009 | 0.002 | 0.002 | 0.054 | 0.027 | 138.058256 | -64.892568 | 
| 3 | 4 | 1523.191 | 888.185 | 23.043 | 0.0125 | 1.028 | 0.0170 | 22.015 | 0.0115 | 23.347 | ... | 1 | 1 | 0.056 | 0.040 | 0.000 | 0.000 | 0.116 | 0.075 | 138.058486 | -64.892352 | 
| 4 | 5 | 1527.557 | 920.728 | 22.079 | 0.0080 | 0.853 | 0.0113 | 21.226 | 0.0080 | 22.328 | ... | 1 | 1 | 0.005 | 0.001 | 0.000 | 0.000 | 0.072 | 0.048 | 138.058342 | -64.891901 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 308728 | 308729 | 4442.749 | 5098.706 | 25.143 | 0.0340 | 1.483 | 0.0423 | 23.660 | 0.0252 | 25.567 | ... | 1 | 1 | 0.014 | 0.031 | 0.002 | 0.001 | 0.183 | 0.167 | 137.963025 | -64.833873 | 
| 308729 | 308730 | 4471.442 | 5010.146 | 22.015 | 0.0079 | 0.864 | 0.0111 | 21.151 | 0.0078 | 22.268 | ... | 1 | 1 | 0.009 | 0.001 | 0.000 | 0.000 | 0.065 | 0.034 | 137.962086 | -64.835103 | 
| 308730 | 308731 | 4484.483 | 4999.601 | 21.080 | 0.0051 | 0.765 | 0.0074 | 20.315 | 0.0053 | 21.299 | ... | 1 | 1 | 0.010 | 0.002 | 0.000 | 0.000 | 0.042 | 0.029 | 137.961660 | -64.835249 | 
| 308731 | 308732 | 4464.859 | 5026.023 | 22.737 | 0.0110 | 1.009 | 0.0150 | 21.728 | 0.0102 | 23.035 | ... | 1 | 1 | 0.008 | 0.007 | 0.146 | 0.115 | 0.075 | 0.045 | 137.962302 | -64.834883 | 
| 308732 | 308733 | 4470.596 | 5047.705 | 25.217 | 0.0355 | 1.328 | 0.0453 | 23.889 | 0.0282 | 25.603 | ... | 1 | 1 | 0.040 | 0.079 | 0.000 | 0.000 | 0.228 | 0.110 | 137.962115 | -64.834581 | 
308733 rows × 23 columns
# And this looks the same as before which is as it should be!
plt.plot(parsed['x'].iloc[:100], parsed['y'].iloc[:100], '.')
[<matplotlib.lines.Line2D at 0x7fc670f5d6a0>]
Congratulations! You have now read in and parsed your first file. This is a suprisingly large part of astronomy. Files can get pretty complicated so having this basic understanding is really helpful!
Now that we have the data loaded into some usable data structure we can think about visualizing it.
Python has many viszlization libraries; however, by far the most commonly used is matplotlib. Matplotlib lets you create any number of graphs in 2 and 3D.
The first thing that we are going to look at is the color-magnitude diagram. In astronomy when we say color we mean the difference in brightness between two "filters" or two very specific wavelength of light. So we might imagine some telescope with a filter called B and filter called G. In this example think of B as the blue filter, only letting through blue light and blocking everything else, and G and the green filter, only letting through green light and blocking everything else. If we take 2 pictures of a star one with the B filter in place and one with the G filter in place we could report the stars Bmag and Gmag, its brightness (or magnitude) in blue light and its magnitude in green light (magnitude has a slightly more subtle definition but brightness is okay to 1st order). We could then say that its B-G color is the difference between the brightness in the blue filter and the green filter. It turns out a lot of interesting structure exists in graphs of magnitude vs color, these are called color-magnitude diagrams (CMDs)
The Hubble Space Telescope (HST) has a variety of filters, they are called things like F606W, F814W, and others in that kinda mold. Don't worry about what the names mean exactly right now (if you are interested: in general they give the center wavelength of the filter and then denote how much space around that center wavelength they cover)
We are going to look at the F606W F814W CMD. In the dataset which we have parsed these have been given the column names Vvega and Ivega.
# Extraxt these from the parsed dataframe, this is not strictly nessisairy; however, often in programming you do things to keep code clean even if it not needed.
# by extracting these here I reduce the number of times I need to write parsed['Vvega/Ivega'] down the line. Simply writing Vvega/Ivega is a bit cleaner and
# will make the code easeier to read down the line.
Vvega = parsed['Vvega']
Ivega = parsed['Ivega']
# Let's plot these now
# '.' tells matplotlib to plot small points (as opposed to 'o' with is large points, or not having anything there in which case it will attempt to connect all points with lines)
plt.plot(Vvega-Ivega, Vvega, '.')
plt.xlabel('F606W-F814W')
plt.ylabel('F606W')
Text(0, 0.5, 'F606W')
So we can already see some structure here; however, there are a few things we can do to make this both more readable and more in line with normal CMDs (how astronomers tend to plot them for some historical reasons)
First of all when we plot CMDs we invert the y-axis (bigger numbers on the bottom). This is because the larger the magnitude the fainter the star. So by flipping the y-axis we keep the brightest objects on top (lowest magnitudes). In order to flip the axis we are going to dive a little more deeply into matplotlib.
Matplotlib operates on a few basic structures. The canvas, the figure, and the axes. In general you don't need to worry about the canvas, thats basically handeled by the backend. The figure only takes a little configureation (this is where you will set the overall size of the figure for example). The axes, which is everything you actually plot, the data, the axis labels, the tick marks, everything you see, is where most of the customization work goes.
Up and until now you have seen graphs made using something like the following:
plt.plot(x, y)
This is great for quick and dirty graphs; however, it limits what you can do latter on. For more powerful and customized plotting we need to use the "matplotlib object model". In general that looks like
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.plot(x, y)
# Lets take a look at this in more detail
# subplots let you make multiple plots in one line, here we only make one so we ask for 1 row and 1 column (thats what the 1, 1 is for)
# the figure (which holds the axes) and the axes (which you plot to) are both returned from this
# finally we set the figure size to 10, 7. These size units are basically arbitray so find ones that work for you
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.plot(Vvega-Ivega, Vvega, '.')
ax.set_xlabel('F606W-F814W')
ax.set_ylabel('F606W')
Text(0, 0.5, 'F606W')
You can see this looks the same as before, except now it is larger. Let's now invert the yaxis. The font size on the x and y axis labels also look quite small, lets make those larger.
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.plot(Vvega-Ivega, Vvega, '.')
# There are many helpful functions within the axes to customize them
ax.axes.invert_yaxis()
ax.set_xlabel('F606W-F814W', fontsize=25)
ax.set_ylabel('F606W', fontsize=25)
Text(0, 0.5, 'F606W')
Now this is much more like a CMD as we would see it in astronomy. However, there are a lot of points here, so many in fact that we cant actually see a lot of the structure. Basically points in the center of the CMD are rendering on top of each other hiding small differences between their location. This is actually somewhat of a challenging problem to solve. Once way is to simply make the points smaller, but this does not always work. What we will do here is make a mathematical representation of the density of these points then plot that. This is called a density map.
To generate the density map we are going to use something called a "gaussian kernel density estimator". The details of how this works are not super important, if you want to read about it here is a good resource https://mathisonian.github.io/kde/. This is a good example how useful python is for glueing different pieces of code together. Note that because of how many points are plotted here this may take quite a while to run
color = Vvega-Ivega
# here we use guassian_kde to create a function we can evaluate at any point in our domain and return the estimate density at that point
# therefore kde_f is a function which takes a coordinate as argument (such as kde_f((0,0)) and returns a density estimate
xy = np.vstack([color, Vvega])
kde_f = gaussian_kde(xy)
# We could evaluate the density at every point we have data for
# However....Thats a lot of points and will take a long time to run
# A faster way is to build up a "mesh". This is a grid of points covering some 
# parameter space. Here we will build a mesh covering the color and Vvega range from each of their min values to each of their max values.
# We do some numpy magic, dont worry about this for now, just know for future that stuff like this is sometimes needed
X, Y = np.mgrid[color.min():color.max():100j, Vvega.min():Vvega.max():100j]
positions = np.vstack([X.ravel(), Y.ravel()])
# We then evaluate the kde_f function and reshape it to be an "image" (2D data / matrix)
Z = np.reshape(kde_f(positions).T, X.shape)
# then we plot that image similar to how we did before. Note how for a 2D data set we use the function imshow
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
ax.invert_yaxis()
# extent here tells matplotlib what numeric range the image data runs over
ax.imshow(np.rot90(Z), extent=[color.min(), color.max(), Vvega.min(), Vvega.max()], norm=LogNorm())
<matplotlib.image.AxesImage at 0x7fc5f016a9e8>
Note how you can see very similar structure to before but it is much more clear that there is a very narrow high density region right in the center (this is the principal sequence of NGC 2808) which rapidly falls off. Previously this density gradient was obscured by all the points we were plotting.
The final thing we are going to do in this notebook is to save these results to disk. Below is a bit of code that plots everything and saves everything
N.B. You may see some divide by zero errors here, thats okay ignore them for now.
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
img = ax.imshow(np.rot90(Z/Z.max()), extent=[color.min(), color.max(), Vvega.min(), Vvega.max()], norm=LogNorm())
ax.set_xlabel("F606W-F814W", fontsize=25)
ax.set_ylabel("F606W", fontsize=25)
# bbox_inches=tight tells matplotlib to fit the file size to the figure size, without it you will end up with a lot of whitespace in your image surrounding your file
fig.savefig("NGC2808_principal_exampleCMD.pdf", bbox_inches="tight")
Please let me know if you have any questions. I know that was a lot to have in one file. Its mostly all here so you have a reference going forward, dont think that you have to be able to understand or reproduce all of this immediatly, pretty soon you'll be old hat at this but for now take it at whatever pace you are comfortable with!