32

The specific example in mind is a list of filenames and their sizes. I can't decide whether each item in the list should be of the form {"filename": "blabla", "size": 123}, or just ("blabla", 123). A dictionary seems more logical to me because to access the size, for example, file["size"] is more explanatory than file[1]... but I don't really know for sure. Thoughts?

clb
  • 521
  • 1
  • 4
  • 3
  • As an addendum, consider [tuple unpacking](https://stackoverflow.com/questions/6967632/unpacking-extended-unpacking-and-nested-extended-unpacking), if you worry about readabilities of tuples - `fname, file_size = file`, where data is your above tuple, would do away with `file[1]` and replace it with `file_size`. Of course this relies on a good documentation. – deepbrook Nov 20 '17 at 13:21
  • 2
    It depends what data structure are you building, and how do you intend to access it? (by filename? by index? both?) Is it just a throwaway variable/data structure, or will you possibly be adding other items(/attributes) as well as size? Does the structure need to remember an order; do you want to sort the list of sizes, or access it by position (e.g. "top-n largest/smallest files")? Depending on those, the 'best' answer could be dict, OrderedDict, namedtuple, plain old list, or a custom class of your own. Need more context from you. – smci Nov 20 '17 at 22:26

5 Answers5

79

I would use a namedtuple:

from collections import namedtuple
Filesize = namedtuple('Filesize', 'filename size')
file = Filesize(filename="blabla", size=123)

Now you can use file.size and file.filename in your program, which is IMHO the most readable form. Note namedtuple creates immutable objects like tuples, and they are more lightweight than dictionaries, as described here.

Doc Brown
  • 199,015
  • 33
  • 367
  • 565
  • 1
    Thanks, good idea, never heard of them before today (I'm pretty novicy at Python). Question: what happens if somebody elsewhere in the code also defines the same "class", possibly slightly differently. e.g., in some other source file, coworker Bob had `Filesize = namedtuple('Filesize', 'filepath kilobytes')` – user949300 Nov 19 '17 at 19:13
  • You can also use the very nice `attrs` module (can find it through `pip` or just search for it), which lets you have very similar syntactic conveniences to named tuple, but can give you the mutability (but can be made immutable too). The main functional difference is that `attrs`-made classes don't compare equal to plain tuples, the way `namedtuple`s do. – mtraceur Nov 19 '17 at 20:28
  • 3
    @DocBrown Python has no concept of declarations. `class`, `def`, and `=` all just overwrite any previous uses. [repl.it](https://repl.it/repls/FrequentDazzlingRook) – Challenger5 Nov 19 '17 at 21:18
  • @Challenger5: you are right, my mistake, so the correct answer is: latest definition counts, no error from the Python runtime, but still similar behaviour as with any other variable. – Doc Brown Nov 19 '17 at 22:30
  • 8
    Note that `namedtuple` is essentially a short hand declaration for a new type with immutable attributes. This means the answer is effectively, "Neither a `tuple` nor a `dict`, but an `object`." +1 – jpmc26 Nov 20 '17 at 02:36
19

{"filename": "blabla", "size": 123}, or just ("blabla", 123)

This is the age old question of whether to encode your format / schema in-band or out-of-band.

You trade off some memory to get the readability and portability that comes from expressing the format of the data right in the data. If you don't do this the knowledge that the first field is the file name and the second is the size has to be kept elsewhere. That saves memory but it costs readability and portability. Which is going to cost your company more money?

As for the immutable issue, remember immutable doesn't mean useless in the face of change. It means we need to grab more memory, make the change in a copy, and use the new copy. That's not free but it's often not a deal breaker. We use immutable strings for changing things all the time.

Another consideration is extensibility. When you store data only positionally, without encoding format information, then you're condemned to only single inheritance, which really is nothing but the practice of concatenating additional fields after the established fields. I can define a 3rd field to be the creation date and still be compatible with your format since I define first and second the same way.

However, what I can't do is bring together two independently defined formats that have some overlapping fields, some not, store them in one format, and have it be useful to things that only know about one or the other formats.

To do that I need to encode the format info from the begining. I need to say "this field is the filename". Doing that allows for multiple inheritance.

You're probably used to inheritance only being expressed in the context of objects but the same ideas work for data formats because, well, objects are stored in data formats. It's exactly the same problem.

So use whichever you think you're most likely to need. I reach for flexibility unless I can point to a good reason not to.

candied_orange
  • 102,279
  • 24
  • 197
  • 315
  • 3
    To be honest, I doubt anyone who's unsure between using an in-band or out-of-band format has such tight performance requirements that they would need to need to use an out-of-band format – Alexander Nov 19 '17 at 20:50
  • 2
    @Alexander very true. I prefer to teach people about it so they understand what they're looking at when confronted with out-of-band solutions. Binary formats often do this for obfuscation reasons. Not everyone wants to be portable. As for performance reasons, if it really matters consider compression before resorting to out-of-band. – candied_orange Nov 19 '17 at 21:19
  • Remember that OP is using Python, so they're probably not too concerned about performance. Most high-level code should be written with readability in mind first; premature optimization is the root of all evil. – Dagrooms Nov 21 '17 at 22:26
  • @Dagrooms don't be hating on Python. It performs well in many cases. But otherwise I agree with everything you said. My point was to say "This is why people do that. Here's why you likely don't care". – candied_orange Nov 21 '17 at 23:40
  • 1
    @CandiedOrange I'm not hating the language, I use it in my daily work. I dislike the way people use it. – Dagrooms Nov 22 '17 at 00:06
7

I would use a class with two properties. file.size is nicer than either file[1] or file["size"].

Simple is better than complex.

JacquesB
  • 57,310
  • 21
  • 127
  • 176
  • In case someone is wondering: For generating JSONs, both work equally well: ``file = Filesize(filename='stuff.txt', size=222)`` and ``filetup = ("stuff.txt", 222)`` both generate the same JSON: ``json.dumps(file)`` and ``json.dumps(filetup)`` result in: ``'["stuff.txt", 222]'`` – Juha Untinen Nov 21 '17 at 07:09
5

Are the filenames unique? If so, you could scrap the list entirely and just use a pure dictionary for all the files. e.g. (a hypothetical website)

{ 
  "/index.html" : 5467,
  "/about.html" : 3425,
  "/css/main.css" : 9876
}

etc...

Now, you don't get "name" and "size", you just use key and value, but often this is more natural. YMMV.

If you really want a "size" for clarity, or you need more than one value for the file, then:

{ 
   "/index.html" : { "size": 5467, "mime_type" : "foo" },
   "/about.html" : { "size": 3425, "mime_type" : "foo" }
   "/css/main.css" : { "size": 9876, "mime_type" : "bar" }
}
user949300
  • 8,679
  • 2
  • 26
  • 35
0

In python, dictionary is mutable object. Other side, tuple is immutable object.

if you need to change dictionary key, value pair often or every time. i suggest dictionary to use.

if you have fixed/static data, i suggest tuple to use.

# dictionary define.
a = {}
a['test'] = 'first value'

# tuple define.
b = ()
b = b+(1,)

# here, we can change dictionary value for key 'test'
a['test'] = 'second'

But, not able to change tuple data using assignment operator.